facebookincubator / oomd

A userspace out-of-memory killer
GNU General Public License v2.0
1.81k stars 143 forks source link

SIGABRT on launch, no backtrace #18

Open dsd opened 6 years ago

dsd commented 6 years ago

Thanks for removing the libfolly dependency that caused me trouble before.

Looking again now, I'm trying to get it working on Debian with systemd-239. The build and install went OK. On first launnch of oomd_bin it segfaulted (with no other error that I could see) and some tracing with gdb indicated that it was because I need to put a config file in place, so I added /etc/oomd.json

{
    "cgroups": [
        {
            "target": "system.slice",
            "oomdetector": "default",
            "oomkiller": "default"
        }
    ],
    "version": "0.2.0"
}

With that in place, it aborts with no logged error. Looking in the source code and with gdb I decided that it was because the memory controller was not active. I set DefaultMemoryAccounting=yes in /etc/systemd.conf and rebooted. (if that's correct, maybe you can add it to the readme?)

Now when I run it I get either this:

# oomd_bin -v
[../Main.cpp:112] oomd running with conf_file=/etc/oomd.json dry=0 verbose=1
[../Config.cpp:119] target_=/sys/fs/cgroup/system.slice
[../../oomd/util/Fs.h:119] Unable to open /etc/oomd_tunables.override
[../shared/Tunables.cpp:32] OOMD_INTERVAL=5
[../shared/Tunables.cpp:32] OOMD_VERBOSE_INTERVAL=300
[../shared/Tunables.cpp:32] OOMD_POST_KILL_DELAY=15
[../shared/Tunables.cpp:32] OOMD_THRESHOLD=60
[../shared/Tunables.cpp:32] OOMD_HIGH_THRESHOLD=80
[../shared/Tunables.cpp:32] OOMD_HIGH_THRESHOLD_DURATION=10
[../shared/Tunables.cpp:32] OOMD_LARGER_THAN=50
[../shared/Tunables.cpp:32] OOMD_GROWTH_ABOVE=80
[../shared/Tunables.cpp:32] OOMD_AVERAGE_SIZE_DECAY=4
[../shared/Tunables.cpp:32] OOMD_FAST_FALL_RATIO=0.85
[../shared/Tunables.cpp:32] OOMD_MIN_SWAP_PCT=15
[../shared/Tunables.cpp:32] OOMD_FBTAX2_WORKLOAD_THRESHOLD=0
Aborted (core dumped)

and for some reason gdb can't figure out the backtrace beyond __GI_abort.

Any idea what I'm doing wrong?

danobi commented 6 years ago

Was there a message or line number that came with the abort? I may have to rethink the async logging a bit, as log messages may not be making it out of the queue in time before the crash.

dsd commented 6 years ago

No, there wasn't anything printed apart from what was pasted.

danobi commented 6 years ago

Sorry about the spotty response time. I'm going to commit a few patches that enable inline logging (ie not async), so we can debug this better. I think we're crashing before the logging thread can flush everything.

danobi commented 6 years ago

BTW I could not reproduce on my machine, but that doesn't surprise me. (Grumble grumble something about early stage projects)

danobi commented 6 years ago

21

danobi commented 6 years ago

Could you update your build to fe8363ff02e and then try:

$ rm -rf build
$ CPPFLAGS=-DINLINE_LOGGING meson build
$ ninja -C build

and get it to crash again?

I think we only explicitly generate SIGABRTs in the codebase. There should always be a log message before a SIGABRT. Hopefully this patch gets us some more info.

danobi commented 6 years ago

I changed my mind on the compile option. It's an env var now (bfd75afa30b9):

INLINE_LOGGING=1 sudo -E ./oomd_bin <blah>
jprvita commented 6 years ago

Hello! I work with @dsd and will be following up on his previous comments here. Thanks for the inline logging fix, I can now see error messages when oomd fails and trace back to the failing code.

The first error we were getting was "Unable to open /sys/fs/cgroup/system.slice/cgroup.subtree_control, which happens because on Debian systems the cgroups2 hierarchy is mounted on /sys/fs/cgroup/unified instead of /sys/fs/cgroup. I was able to work around the problem by changing the target in /etc/oomd.json to unified/system.slice, but you may want to make this a bit more generic. One idea would be to detect at run-time where the cgroups2 hierarchy is mounted.

Another problem I ran into was "FATAL: cgroup memory controller not enabled on /sys/fs/cgroup/unified/system.slice", because all cgroup controllers were bound to the cgroups v1 hierarchy mounted by default on Debian systems. Passing cgroup_no_v1=all to the kernel command line and then manually binding the memory controller to the cgroups2 hierarchy worked around that problem.

Finally, most generic purpose distros do not enable CONFIG_MEMCG_SWAP_ENABLE since it increases memory consumption, so I oomd failed with "Unable to open /sys/fs/cgroup/unified/system.slice/memory.swap.current". Enabling it at runtime with swapaccount=1 avoids the problem, and I now have oomd running.

It would be great to have these points fixed to have oomd more compatible with general-purpose distros. Also, it would be really nice to have more documentation on how to use it in such setups, for example, how to specify an overall threshold of memory pressure for the target cgroup (lets say, user.slice) above which any process in that cgroup should be killed -- unless I have missed it is only possible to specify thresholds per subgroup of the target cgroup.