containers / podman

Podman: A tool for managing OCI containers and pods.
https://podman.io
Apache License 2.0
23.72k stars 2.41k forks source link

pod resource limits: error creating cgroup path: subtree_control: ENOENT #15074

Open lsm5 opened 2 years ago

lsm5 commented 2 years ago

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

aarch64 CI enablement at #14801 is experiencing failures in the system tests. This issue is a placeholder for tracking and using in FIXME comments for skip_if_aarch64.

edsantiago commented 2 years ago

pod resource limits test is in code that @cdoern just merged last week:

# # podman --cgroup-manager=cgroupfs pod create --name=resources-cgroupfs --cpus=5 --memory=5m --memory-swap=1g --cpu-shares=1000 --cpuset-cpus=0 --cpuset-mems=0 --device-read-bps=/dev/loop0:1mb --device-write-bps=/dev/loop0:1mb --blkio-weight-device=/dev/loop0:123 --blkio-weight=50
# Error: error creating cgroup path /libpod_parent/e0024c8b8ccc24c247b62a422433c0b69d7c3f930bad3863563fcec0d0db43f1: write /sys/fs/cgroup/libpod_parent/cgroup.subtree_control: no such file or directory
# [ rc=125 (** EXPECTED 0 **) ]
edsantiago commented 2 years ago

sdnotify test is systemd, so @vrothberg might be the best person to look at it, but it also could be crun, so, ping, @giuseppe also:

# # podman run -d --sdnotify=container quay.io/libpod/fedora:31 sh -c printenv NOTIFY_SOCKET;echo READY;systemd-notify --ready;while ! test -f /stop;do sleep 0.1;done
# 2ff76f9670f13c479196440ac93babe9fc4afa8cbb0e0b6799b73a3b59969292
# # podman logs 2ff76f9670f13c479196440ac93babe9fc4afa8cbb0e0b6799b73a3b59969292
# /run/notify/notify.sock
# READY
# �[0;1;31mFailed to notify init system: Permission denied�[0m

Lots more permission and SELinux errors, make me strongly suspect that SELinux is broken on these systems. It might be that the only way to debug is to ssh into one of them.

edsantiago commented 2 years ago

@lsm5 hint for next time: file the issue first, then go to the broken PR and find links to all the failing logs, paste them in the issue, and then resubmit the PR with skips. It's almost impossible to find old Cirrus logs for a PR. (I scraped the above from comments I made in your PR, so no problem. Just something to keep in mind for next time!)

cdoern commented 2 years ago

pod resource limits test is in code that @cdoern just merged last week:

# # podman --cgroup-manager=cgroupfs pod create --name=resources-cgroupfs --cpus=5 --memory=5m --memory-swap=1g --cpu-shares=1000 --cpuset-cpus=0 --cpuset-mems=0 --device-read-bps=/dev/loop0:1mb --device-write-bps=/dev/loop0:1mb --blkio-weight-device=/dev/loop0:123 --blkio-weight=50
# Error: error creating cgroup path /libpod_parent/e0024c8b8ccc24c247b62a422433c0b69d7c3f930bad3863563fcec0d0db43f1: write /sys/fs/cgroup/libpod_parent/cgroup.subtree_control: no such file or directory
# [ rc=125 (** EXPECTED 0 **) ]

the only reason this should fail is if arm does not have subtree control which I find highly unlikely. the subtree_control file is less related to my resource limits work and more related to cgroup creation in general. I know where this is done in containers/common but still... an issue like this makes me think the kernel is missing some things when complied.

giuseppe commented 2 years ago

the only reason this should fail is if arm does not have subtree control which I find highly unlikely. the subtree_control file is less related to my resource limits work and more related to cgroup creation in general. I know where this is done in containers/common but still... an issue like this makes me think the kernel is missing some things when complied.

could also be libpod_parent/ missing

cdoern commented 2 years ago

True @giuseppe but libpod_parent is created (if it does not exist) before subtree control I believe?

giuseppe commented 2 years ago

then /sys/fs/cgroup might not be a cgroup v2 mount

edsantiago commented 2 years ago

It's v2. I'm doing the Cirrus rerun-with-terminal thing, and trying to reproduce it, and can't: hack/bats 200:resource passes, as does manually recreating the fallocate, losetup, echo bfq, podman pod create commands. This could be something context-sensitive, where a prior test sets the system up in such a way that it causes this test to fail.

edsantiago commented 2 years ago

Still failing, but @lsm5 believes it might be a flake (which is consistent with my findings in the rerun terminal). I don't know if that's better or worse.

edsantiago commented 2 years ago

I'll be darned. It is a flake.

edsantiago commented 2 years ago

@cdoern @giuseppe please use @cevich's #15145 to spin up VMs and debug this.

github-actions[bot] commented 2 years ago

A friendly reminder that this issue had no activity for 30 days.

edsantiago commented 2 years ago

pod resource limits still flaking

edsantiago commented 1 year ago

Still happening on f38:

[+1177s] not ok 317 pod resource limits
...
<+008ms> # # podman --cgroup-manager=cgroupfs pod create --name=resources-cgroupfs --cpus=5 --memory=5m --memory-swap=1g --cpu-shares=1000 --cpuset-cpus=0 --cpuset-mems=0 --device-read-bps=/dev/loop0:1mb --device-write-bps=/dev/loop0:1mb --blkio-weight=50
<+209ms> # Error: creating cgroup path /libpod_parent/9f84a4a2767e6495567aaf02a54447213083db7484d539edae31add828221b45: write /sys/fs/cgroup/libpod_parent/cgroup.subtree_control: no such file or directory
edsantiago commented 1 year ago

Seen just now on my RH laptop:

✗ pod resource limits
...
   [05:05:24.431787056] # .../bin/podman --cgroup-manager=cgroupfs pod create --name=resources-cgroupfs --cpus=5 --memory=5m --memory-swap=1g --cpu-shares=1000 --cpuset-cpus=0 --cpuset-mems=0 --device-read-bps=/dev/loop0:1mb --device-write-bps=/dev/loop0:1mb --blkio-weight=50
   [05:05:24.528324789] Error: creating cgroup path /libpod_parent/09404b9d6c87cce725635b445cfc3b5bf0f5fb654dfece8a15296915e6d71871: write /sys/fs/cgroup/libpod_parent/cgroup.subtree_control: no such file or directory
   [05:05:24.541146057] [ rc=125 (** EXPECTED 0 **) ]

Passed on rerun. Again, this is my RH laptop, not aarch64.

edsantiago commented 2 months ago

Seen after a long absence, f40 root, in parallel system tests though I doubt the parallel has anything to do with anything.

edsantiago commented 2 months ago

Ping, seeing this one often in parallel system tests.

x x x x x x
sys(7) podman(7) fedora-40-aarch64(2) root(7) host(7) sqlite(6)
rawhide(2) boltdb(1)
fedora-40(2)
fedora-39(1)
edsantiago commented 2 months ago

Continuing to see this often in parallel system tests

x x x x x x
sys(12) podman(12) fedora-40(5) root(12) host(12) sqlite(8)
fedora-40-aarch64(3) boltdb(4)
rawhide(2)
fedora-39(2)
giuseppe commented 2 months ago

adding some code through https://github.com/containers/common/pull/2158 to help debugging this issue