Closed oseiberts11 closed 1 year ago
This looks outside of our control, in any case please test this with the latest versions (kernel and podman).
I tried to reproduce it in a VM (strategy: first reproduce with the same versions, then later update things to see that it works again), but this strategy failed. Unfortunately the only Linux VM images I have available are based on an ext4 file system, and with that the issue did not reproduce. I tried adding an extra virtual disk, formatted with xfs, and using that as much as possible (mounted as the home directory of the user), but it still did not reproduce. Does anyone know the best place to report xfs and/or unionfs trouble? Maybe the stack trace I reported is enough to tell somebody what/where the problem is.
I would report them to the linux kernel as a bugzilla.
That depends on whenever the xfs maintainers even use the kernel bugzilla. You should refer to https://docs.kernel.org/admin-guide/reporting-issues.html. Either way I think it is very likely that you have to test the latest kernel.
Issue Description
In https://github.com/containers/podman/issues/16062#issuecomment-1550239180 I was asked to report a new issue here; mainly to eliminate it as a bug in podman itself.
I will try to squeeze the information I have into this template.
https://github.com/containers/podman/issues/16062#issuecomment-1334397333 considers a locking bug inside podman. We seem to have problems regarding a deadlock in the xfs file system, triggered by invoking podman. I am wondering if these can be explained by the same explanation or not (but it smells like a kernel bug).
The issue occurs when starting a container for RabbitMQ with --user rabbitmq --userns=keep-id, on newer kernels (we noticed when updating Ubuntu from Bionic 18.04 to Focal 20.04) when the native overlayfs is used and not fuse-overlayfs.
One thing that is sub-optimal is that RabbitMQ needs its home directory mounted in the container (/var/lib/rabbitmq) but this is also where Podman stores all the container files; so effectively the container files are mounted into the container.
Our current workaround is adding
--storage-opt "overlay.mount_program=/usr/bin/fuse-overlayfs"
.The user
rabbit
in question has/var/lib/rabbitmq
as its home directory, and it is also bound into the container. Its UID differs from the UID of the userrabbitmq
which may have been installed by a rabbitmq package. This may or may not be a factor in the bug.The version of Podman we have is fairly old (3.4.2) but Ubuntu doesn't seem to have packaged a newer version for Ubuntu 20.04. Therefore we could not try so far if updating podman helps us.
Steps to reproduce the issue
Steps to reproduce the issue
systemd start rabbitmq
with the following systemd unit file:We pull a rabbitmq image from our own container registry, but since the problem occurs before the container even really starts, I think it doesn't matter too much what exactly is in the image. But it was made with this Dockerfile:
Describe the results you received
The machine became unusable, processes got stuck, logins became impossible.
According to the kernel log file, processes got stuck in xfs file system code. Here is the first such process, but more followed after this.
The kernel warns about processes being stuck; here is one of them:
After this, more and more processes get blocked as they try to access the file system. This makes the whole machine unusable.
Describe the results you expected
I expected the program in the container image to start, but it never got to that point.
podman info output
Podman in a container
No
Privileged Or Rootless
Rootless
Upstream Latest Release
No
Additional environment details
This is on real iron.
Additional information
No response