containers / conmon

An OCI container runtime monitor.
Apache License 2.0
420 stars 128 forks source link

Conmon breaks Podman run on Fedora 36 workstation for rootless users #350

Closed PavelSosin-320 closed 2 years ago

PavelSosin-320 commented 2 years ago

Conmon version 2.1.0 Podman version 4.1.1 - last available for Fedora 36 OS Fedora 36 workstation

Command podman run .... Ubuntu Error [conmon.d]; failed to write to /proc/self/oom_score_adj] : permission denied. The issue is discussed in the details in Stack overflow cannot-open-proc-self-oom-score-adj-when-i-have-the-right-capability . The topic doesn't provide the clear answer how to avoid this issue and , even, security related, very sorry. Maybe, for the Workstation's GUI session this line is not too critical and can be skipped? Gnome session has its own OOM killer that is not easily accessible from the conmon. Something in the style Disable OOM killer at the system level.

P.S. I looked into the code and found that an error of oom_adj writing is silently ignored. I think that it is impossible to ignore memory management errors for the memory consuming process like conmon because it makes OOM killer behavior unpredictable and the system unstable. It would be more correct to let it to grow and take memory watermark if it is killed by OOM killer or disable OOM Killer altogether.

haircommander commented 2 years ago

This is expected and (mostly) harmless. Basically, conmon attempts to set its own OOM adjust score to -1000. This means it cannot be OOM killed, which is required because (in cgroupv1) the kernel can mistakenly kill conmon even if its its container that is problematically consuming memory).

However, rootless users aren't permitted to update their own OOM adjust score. The message you see is conmon notifying you that's the case. It's expected, and likely alright, though in very memory contentious environments, be aware that the kernel may kill conmon, leaving the container in an odd state.

PavelSosin-320 commented 2 years ago

But somewhere, container is not created. Podman on Fedora uses cgroupv2 API managed by systemd. I afraid that it exactly what happens and discussed in the Stack overflow post: Conmon and Crun require some capabilities and so, nodes under /proc/self can be managed only by root. Indeed, it works if the user is root. Podman also works inside Gnome boxes stateless VM running CoreOS. that has no OOM killer service. Unfortunately, both Fedora WS and CoreOS are "rootless". They have no root user and group 😢. This is the new RedHat security model. Admin / Core can be sudoers but root user itself can't be created.

haircommander commented 2 years ago

Okay so the root (pardon the pun) of the problem is containers aren't coming up with rootless podman? did you get your issue resolved from stack overflow or do you need more help? did you read the podman rootless debugging guide? https://github.com/containers/podman/blob/main/rootless.md

PavelSosin-320 commented 2 years ago

@haircommander The issue is not mentioned in the "Troubleshooting". This is purely the problem of OOM killer in the Proxy design pattern. As I mentioned the post is not marked as SOLVED, doesn't offer any clear solution without compromising security since 2018. It is also related to capabilities inheritance from the conmon to crun. Personally, I would prefer to sacrifice container runtime (crun) in design time but conmon in the production environment. But it means that OOM killer behavior has to be managed separately. The solution has to be harmonized with cgroup structure defined by cgroupns option.

haircommander commented 2 years ago

it's not clear to me what you're asking for. Are you requesting conmon's oom adjust score be set to -1000 in rootless mode for fear of it being oom killed?

doesn't offer any clear solution without compromising security since 2018

I'm not sure what post you're referring to here

FWIW, i believe cgroupv2 acts smarter in these situations and kills the container, though you may want to test that

mheon commented 2 years ago

To be 100% clear here: the warning message displayed by Conmon in this case is a warning. It is not fatal and does not have any impact on the proper functioning of Podman and Conmon. If you are seeing any issues with Podman, they are unrelated to this message.

If there is any takeaway here, perhaps it should be a rewording of the error message to sound a little less important/fatal (at least, when Conmon is run without root privileges)?

PavelSosin-320 commented 2 years ago

@haircommander Someway when this message appears in the log container is never created without writing something else to the log in my Podman version and OS. Regarding conferences about ghost container in the scenario when conmon is killed by OOM killer I think that when conmon gets KILL signal regardless of its source it must attempt to collect statistics, disconnect from the OCI runtime (security), propagate KILL to the runtime, release all resources to avoid leaks before exit. The Access denied is generated by the Kernel, not by the conmon. If the error is obsoletely harmless it has to appear as info, not error and not prevent container creation.

PavelSosin-320 commented 2 years ago

@haircommander Since SIGKILL interception is not possible PRCTL may orchestrate behavior of conmon as a proxy and crun as a executor. When conmon dies crun must die too. It is proposed in the 1st post from serverfault but I don't see it in the code. If Proxy and executor always die together I can ignore the reason which process caused OOM. So, again writing to oom_adj of conmon is useless because runtime can't run as orphan.

PavelSosin-320 commented 2 years ago

@haircommander Sorry, but when I see DEBU[0001] running conmon: /usr/bin/conmon args="[--api-version 1 -c b0375ed5120c3d6e0099611eb28f1767b3ffa2f8b61321303da483e99f032141 -u b0375ed5120c3d6e0099611eb28f1767b3ffa2f8b61321303da483e99f032141 -r /usr/bin/crun -b /home/pavelsosin/.local/share/containers/storage/btrfs-containers/b0375ed5120c3d6e0099611eb28f1767b3ffa2f8b61321303da483e99f032141/userdata -p /run/user/1000/containers/storage/btrfs-containers/b0375ed5120c3d6e0099611eb28f1767b3ffa2f8b61321303da483e99f032141/userdata/pidfile -n unruffled_solomon --exit-dir /run/user/1000/libpod/tmp/exits --full-attach -s -l journald --log-level debug --syslog -t --conmon-pidfile /run/user/1000/containers/storage/btrfs-containers/b0375ed5120c3d6e0099611eb28f1767b3ffa2f8b61321303da483e99f032141/userdata/conmon.pid --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /home/pavelsosin/.local/share/containers/storage --exit-command-arg --runroot --exit-command-arg /run/user/1000/containers/storage --exit-command-arg --log-level --exit-command-arg debug --exit-command-arg --cgroup-manager --exit-command-arg systemd --exit-command-arg --tmpdir --exit-command-arg /run/user/1000/libpod/tmp --exit-command-arg --network-config-dir --exit-command-arg --exit-command-arg --network-backend --exit-command-arg cni --exit-command-arg --volumepath --exit-command-arg /home/pavelsosin/.local/share/containers/storage/volumes --exit-command-arg --runtime --exit-command-arg crun --exit-command-arg --storage-driver --exit-command-arg btrfs --exit-command-arg --events-backend --exit-command-arg journald --exit-command-arg --syslog --exit-command-arg container --exit-command-arg cleanup --exit-command-arg b0375ed5120c3d6e0099611eb28f1767b3ffa2f8b61321303da483e99f032141]" INFO[0001] Running conmon under slice user.slice and unitName libpod-conmon-b0375ed5120c3d6e0099611eb28f1767b3ffa2f8b61321303da483e99f032141.scope DEBU[0001] Received: -1
DEBU[0001] Cleaning up container b0375ed5120c3d6e0099611eb28f1767b3ffa2f8b61321303da483e99f032141 DEBU[0001] Tearing down network namespace at /run/user/1000/netns/netns-0a64e6c8-5342-9a29-0be8-bd2f32f91e7f for container b0375ed5120c3d6e0099611eb28f1767b3ffa2f8b61321303da483e99f032141 DEBU[0001] Unmounted container "b0375ed5120c3d6e0099611eb28f1767b3ffa2f8b61321303da483e99f032141" DEBU[0001] ExitCode msg: "crun: [conmon:d]: failed to write to /proc/self/oom_score_adj: permission denied\n\nopen executable: permission denied: oci permission denied" Error: crun: [conmon:d]: failed to write to /proc/self/oom_score_adj: Permission denied

open executable: Permission denied: OCI permission denied [pavelsosin@Dell ~]$ id uid=1000(pavelsosin) gid=1000(pavelsosin) groups=1000(pavelsosin),10(wheel),36(kvm) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 and then the container doesn't exist I have to believe that conmon and crun implement proxy pattern in the way that make OOM adjustment impossible. Maybe, because Caps inheritance or because conmon and crun have to be run as a group of processes. Possibly, resource limit dependency issue is solved at the cgroup level via annotation but not at OOM Killer level. Killing conmon alone due to memory deficit looks not so catastrophic if the container continues to run and fills log.

mheon commented 2 years ago

@PavelSosin-320 That is not an OOM issue. The actual issue is the open executable: Permission denied: OCI permission denied message. It is not related to the preceding [conmon:d]: failed to write to /proc/self/oom_score_adj: Permission denied message, which is a harmless warning. Your container is failing to start because the OCI runtime cannot exec the first process in the container.

It's very curious that message is working its way into the printed error in Podman, though. That seems like a bug? I'd expect we'd only be printing the last message. It's also prefixing the [conmon:d] with crun when I'm quite confident that message isn't coming from crun. Perhaps a result of --log-level=debug causing Conmon to emit messages after crun is started?

PavelSosin-320 commented 2 years ago

Without --loglevel=debug the message also appeared. Everything is very simple: choom -p -n -1000 gives me "Permission denied" for running podman service. It means denied. The writing to the old file required the process to be privileged. Possibly today also. There is no virgin rootless processes in the Gnome environment, every process has some privileges to communicate with environment. If it breaks conmon the game is over. Podman has to carefully calibrate caps passed to the OCI runtime and its proxy.

PavelSosin-320 commented 2 years ago

The assumption that "Permission denied" is triggered by the OCI runtime looks correct: I can do container create but can't container init from the plain Ubuntu image from the DockerHub/library. Unfortunately, other standard images fail. Just check: Which SELinux label is expected for files of container runtime in the $HOME/container/storage. Everything under HOME gets unconfines_u:object_r:uswr_home_dir_t. Possibly it is job of system-homed unit hands. If CRun expects something else the unit that prepares container-storage and runs after homed is necessary. The homed has reasons to prepare home-dir for maintenance using homectl. If "container" is expected then "Post-homed" unit is needed to prepare container storage. Simple podman breaking scenario: restart systemd-homed. Then files in the container/storage become unlabeled and under runroot - labeled as tmp. In other words - files persist but Selinux labeling don't over session logon/switch user. Is it curable?

PavelSosin-320 commented 2 years ago

@mheon Please, pay attention that /run filesystem type is tmpfs. In the recent Kernals tmpfs is in-memory file system and all restrictions relevant for memory writing/reading and execution of code in the memory are applicable to the Podman runroot. Segmentation fault, disk full, etc. issues can be related to this fact. Indeed, Podman init of created container fails in the same way as podman run. Hopefully, doesn't execute something from the container's image because "initialized" but not "started" container can't be stopped and its processes can run forever..

PavelSosin-320 commented 2 years ago

Why Podman perfectly works and runs the ubuntu image from the Dockerhub library inside Fedora 36 CoreOS VM under Gnome-boxes, i.e. completely in the Userspace of the core user id 1000 but doesn't do it in the Fedora WS myuser id 1000 ??? The only obvious difference that I see is storage driver defined in the storage.conf file. It looks like Podman uses the default overlay driver in the CoreOS and btrfs in the case of Fedora WS. Could somebody check that Crun looks at storage.conf and really supports storage configuration and btrfs?

PavelSosin-320 commented 2 years ago

OK. I know what is it - this is very old btrfs bug BTRFS bug RedHatbugzilla that appeared 1st time in the Docker Fedora 26 and then closed due to EOL and re-detected in the subsequent versions. Dan Walsh is aware about the strange btrfs behavior. The files in the containers subvolumes eventually drop SeLinux labeling and it causes hidden SeLinux rejection that is not visible in the Monitor (???). The root directory of Podman's storage drops its label too because it is inside HOME filesystem. The Kernel and btrfs versions changed from Fedora 27 but the issue is not detected and corrected.

rhatdan commented 2 years ago

So not something Podman can fix. Closing.

PavelSosin-320 commented 2 years ago

@rhatdan I will happily except it if btrfs home FS can be converted to stratis on existing Fedora 36 machine on-the-place or after my returning from the vacation Fedora 37 will be released and Fedora 36 WS will be upgradable to the 37 with podman. Please, don't take it as something personal.

rhatdan commented 2 years ago

No problem. I did not take it personal. (Never do). :^)