facebookincubator / oomd

A userspace out-of-memory killer
GNU General Public License v2.0
1.78k stars 142 forks source link

doesn't kill cgroup, unable to set xattr trusted.oomd_ooms=1 #122

Open nartes opened 4 years ago

nartes commented 4 years ago

Description: oomd has identified a process, but can't kill it.

Package: https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=oomd

Mar 06 17:53:21 MACHINE_NAME oomd[69346]: [../src/oomd/util/Fs.cpp:576] Unable to set xattr trusted.oomd_ooms=1 on /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/gnome-launched-firefox-11870.scope. errno=30
Mar 06 17:53:21 MACHINE_NAME oomd[69346]: [../src/oomd/plugins/BaseKillPlugin.cpp:96] Trying to kill /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/gnome-launched-firefox-11870.scope
Mar 06 17:53:21 MACHINE_NAME oomd[69346]: [../src/oomd/plugins/KillMemoryGrowth-inl.h:168] Picked "user.slice/user-1000.slice/user@1000.service/gnome-launched-firefox-11870.scope" (2040MB) based on size > 10% of total 6989MB (size threshold overridden)
Mar 06 17:53:21 MACHINE_NAME oomd[69346]: [../src/oomd/util/Fs.cpp:576] Unable to set xattr trusted.oomd_kill=0 on /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/gnome-terminal-server.service. errno=30
Mar 06 17:53:21 MACHINE_NAME oomd[69346]: [../src/oomd/plugins/BaseKillPlugin.cpp:141] Killed 0: 1377(ssh-agent)[E1] 1401(tmux: server)[E1] 1402(zsh)[E1] 1427(zsh)[E1] 1454(vim)[E1] 1455(zsh)[E1] 1485(htop)[E1] 1496(zsh)[E1] 1521(zsh)[E1] 46339(zsh)[E1] 4>
Mar 06 17:53:21 MACHINE_NAME oomd[69346]: [../src/oomd/util/Fs.cpp:576] Unable to set xattr trusted.oomd_ooms=1 on /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/gnome-terminal-server.service. errno=30
Mar 06 17:53:21 MACHINE_NAME oomd[69346]: [../src/oomd/plugins/BaseKillPlugin.cpp:96] Trying to kill /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/gnome-terminal-server.service
Mar 06 17:53:21 MACHINE_NAME oomd[69346]: [../src/oomd/plugins/KillMemoryGrowth-inl.h:168] Picked "user.slice/user-1000.slice/user@1000.service/gnome-terminal-server.service" (2370MB) based on size > 10% of total 6989MB (size threshold overridden)
Mar 06 17:53:21 MACHINE_NAME oomd[69346]: [../src/oomd/OomdContext.cpp:163]   io_cost_cumulative=0 io_cost_rate=0
Mar 06 17:53:21 MACHINE_NAME oomd[69346]: [../src/oomd/OomdContext.cpp:156]   mem=8MB mem_avg=7MB mem_low=0MB mem_min=0MB mem_prot=0MB anon=6MB swap_usage=0MB
Mar 06 17:53:21 MACHINE_NAME oomd[69346]: [../src/oomd/OomdContext.cpp:151]   pressure=0:0:0-0:0:0
danobi commented 4 years ago

In

1377(ssh-agent)[E1] 1401(tmux: server)[E1] 1402(zsh)[E1]

E1 means kill(, SIGKILL) failed with EPERM. Is oomd running with the right permissions?

In

Unable to set xattr trusted.oomd_kill=0 on /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/gnome-terminal-server.service. errno=30

errno=30 means setxattr failed with EROFS (readonly FS).

Are you using a hybrid cgroup1 + cgroup2 setup? May be unrelated but would be good to know.

nartes commented 4 years ago

@danobi perhaps it is some issue with cgroups permissions setup. Could you tell me some bash commands to debug a killing procedure? I didn't read the source yet, but thought about just hacking it with system('kill -9 %d', process_pid) in Fs.cpp instead of cryptic trusted.oomd_kill = 0 attributes. What is this attribute, is it related to a facebook contributed kernel module? I didn't find any documents on a killing procedure used by oomd. It is puzzling me at the moment.

P.S.

  1. cgroups configuring on archlinux https://aur.archlinux.org/cgit/aur.git/commit/?h=oomd&id=3a6dcdb577bfa3c874894889315f0c940174bf73
  2. Some kernel parameters in PKGBUILD https://aur.archlinux.org/cgit/aur.git/commit/?h=oomd&id=3a6dcdb577bfa3c874894889315f0c940174bf73

P.P.S.

systemctl status oomd
● oomd.service - userspace out-of-memory killer
     Loaded: loaded (/usr/lib/systemd/system/oomd.service; enabled; vendor preset: disabled)
     Active: active (running) since Sun 2020-03-08 22:35:50 +03; 24h ago
    Process: 584 ExecStartPre=/usr/bin/oomd --check-config ${OOMD_CONFIG} (code=exited, status=0/SUCCESS)
   Main PID: 594 (oomd)
      Tasks: 3 (limit: 9336)
     Memory: 2.6M (low: 64.0M)
        CPU: 9min 34.582s
     CGroup: /system.slice/oomd.service
             └─594 /usr/bin/oomd --config /etc/oomd.json --interval 5

Mar 09 22:38:21 MACHINE_NAME oomd[594]: [../src/oomd/OomdContext.cpp:156]   mem=11MB mem_avg=11MB mem_low=0MB mem_min=0MB mem_prot=0MB anon=6MB swap_usage=0MB
Mar 09 22:38:21 MACHINE_NAME oomd[594]: [../src/oomd/OomdContext.cpp:163]   io_cost_cumulative=0 io_cost_rate=0
Mar 09 22:38:21 MACHINE_NAME oomd[594]: [../src/oomd/OomdContext.cpp:150] name=user.slice/user-1000.slice/user@1000.service/gsd-media-keys.service
Mar 09 22:38:21 MACHINE_NAME oomd[594]: [../src/oomd/OomdContext.cpp:151]   pressure=0:0:0-0:0:0
Mar 09 22:38:21 MACHINE_NAME oomd[594]: [../src/oomd/OomdContext.cpp:156]   mem=8MB mem_avg=8MB mem_low=0MB mem_min=0MB mem_prot=0MB anon=6MB swap_usage=0MB
Mar 09 22:38:21 MACHINE_NAME oomd[594]: [../src/oomd/OomdContext.cpp:163]   io_cost_cumulative=0 io_cost_rate=0
Mar 09 22:38:21 MACHINE_NAME oomd[594]: [../src/oomd/plugins/KillMemoryGrowth-inl.h:168] Picked "user.slice/user-1000.slice/user@1000.service/gnome-launched-firefox-29108.scope" (2519MB) based on size > 10% of tot>
Mar 09 22:38:21 MACHINE_NAME oomd[594]: [../src/oomd/plugins/BaseKillPlugin.cpp:92] OOMD: In dry-run mode; would have tried to kill /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/gnome-launched-firefo>
Mar 09 22:38:21 MACHINE_NAME oomd[594]: [../src/oomd/Log.cpp:114] 0.00 0.00 0.00 user.slice/user-1000.slice/user@1000.service/gnome-launched-firefox-29108.scope 2641997824 ruleset:[user session protection] detecto>
Mar 09 22:38:22 MACHINE_NAME oomd[594]: [../src/oomd/engine/Ruleset.cpp:134] Action=kill_by_memory_size_or_growth returned STOP. Terminating action chain.

P.P.P.S. A process of /usr/bin/oomd is being executed under root user.

P.P.P.S.

yay -Qs cgroup
local/libcgroup 0.41-2
    Library that abstracts the control group file system in Linux
danobi commented 4 years ago

This is the kill code: https://github.com/facebookincubator/oomd/blob/master/src/oomd/plugins/BaseKillPlugin.cpp#L138

cryptic trusted.oomd_kill = 0 attributes. What is this attribute, is it related to a facebook contributed kernel module?

It's an extended attribute. See man 7 xattr for more details. It's so delegated cgroup subtrees can know when a kill was performed.

Your systemctl status oomd shows dry-run mode on. With dry run mode on for plugins the previous log messages cannot have been printed. Are you sure you're sending information about the same setup?

danobi commented 4 years ago

Can you also share the oomd config you're using?

nartes commented 4 years ago

It is in dry-run, but the above problem has been reported without it. I've used dry run mode to debug killing selector.

//
// Basic configuration for a desktop linux machine
//

{
    "rulesets": [
        {
            "name": "user session protection",
            "detectors": [
                [
                    "user pressure above 60 for 30s",
                    {
                        "name": "memory_above",
                        "args": {
                            "cgroup": "user.slice",
                            "threshold": "80%",
                            "duration": "1"
                        }
                    }
                ]
            ],
            "actions": [
                {
                    "name": "kill_by_memory_size_or_growth",
                    "args": {
                        "cgroup": "user.slice/user-*.slice/user@*.service/*",
            "size_threshold": 10,
            "post_action_delay": 1,
            "dry": true
                    }
                }
            ]
        }
    ]
}