Open dsd opened 6 years ago
Was there a message or line number that came with the abort? I may have to rethink the async logging a bit, as log messages may not be making it out of the queue in time before the crash.
No, there wasn't anything printed apart from what was pasted.
Sorry about the spotty response time. I'm going to commit a few patches that enable inline logging (ie not async), so we can debug this better. I think we're crashing before the logging thread can flush everything.
BTW I could not reproduce on my machine, but that doesn't surprise me. (Grumble grumble something about early stage projects)
Could you update your build to fe8363ff02e and then try:
$ rm -rf build
$ CPPFLAGS=-DINLINE_LOGGING meson build
$ ninja -C build
and get it to crash again?
I think we only explicitly generate SIGABRTs in the codebase. There should always be a log message before a SIGABRT. Hopefully this patch gets us some more info.
I changed my mind on the compile option. It's an env var now (bfd75afa30b9):
INLINE_LOGGING=1 sudo -E ./oomd_bin <blah>
Hello! I work with @dsd and will be following up on his previous comments here. Thanks for the inline logging fix, I can now see error messages when oomd fails and trace back to the failing code.
The first error we were getting was "Unable to open /sys/fs/cgroup/system.slice/cgroup.subtree_control, which happens because on Debian systems the cgroups2 hierarchy is mounted on /sys/fs/cgroup/unified
instead of /sys/fs/cgroup
. I was able to work around the problem by changing the target in /etc/oomd.json
to unified/system.slice
, but you may want to make this a bit more generic. One idea would be to detect at run-time where the cgroups2 hierarchy is mounted.
Another problem I ran into was "FATAL: cgroup memory controller not enabled on /sys/fs/cgroup/unified/system.slice", because all cgroup controllers were bound to the cgroups v1 hierarchy mounted by default on Debian systems. Passing cgroup_no_v1=all
to the kernel command line and then manually binding the memory controller to the cgroups2 hierarchy worked around that problem.
Finally, most generic purpose distros do not enable CONFIG_MEMCG_SWAP_ENABLE since it increases memory consumption, so I oomd failed with "Unable to open /sys/fs/cgroup/unified/system.slice/memory.swap.current". Enabling it at runtime with swapaccount=1
avoids the problem, and I now have oomd running.
It would be great to have these points fixed to have oomd more compatible with general-purpose distros. Also, it would be really nice to have more documentation on how to use it in such setups, for example, how to specify an overall threshold of memory pressure for the target cgroup (lets say, user.slice) above which any process in that cgroup should be killed -- unless I have missed it is only possible to specify thresholds per subgroup of the target cgroup.
Thanks for removing the libfolly dependency that caused me trouble before.
Looking again now, I'm trying to get it working on Debian with systemd-239. The build and install went OK. On first launnch of
oomd_bin
it segfaulted (with no other error that I could see) and some tracing with gdb indicated that it was because I need to put a config file in place, so I added/etc/oomd.json
With that in place, it aborts with no logged error. Looking in the source code and with gdb I decided that it was because the memory controller was not active. I set
DefaultMemoryAccounting=yes
in /etc/systemd.conf and rebooted. (if that's correct, maybe you can add it to the readme?)Now when I run it I get either this:
and for some reason gdb can't figure out the backtrace beyond
__GI_abort
.Any idea what I'm doing wrong?