Open edsantiago opened 1 month ago
Seen again, different test:
not ok 192 |200| podman pod cleans cgroup and keeps limits in 924ms
# tags: ci:parallel
# (from function `bail-now' in file test/system/helpers.bash, line 192,
# from function `die' in file test/system/helpers.bash, line 958,
# from function `run_podman' in file test/system/helpers.bash, line 560,
# in test file test/system/200-pod.bats, line 768)
# `run_podman pod create --infra=$infra --memory=256M' failed
#
# [09:08:20.190255676] $ bin/podman pod create --infra=true --memory=256M
# [09:08:20.601836522] Error: unable to create pod cgroup for pod 35e64b5202197234d9b5d1633a7fdbeae3f5cba98fe90d2e68f1c09f66a8c1bc: creating cgroup user.slice/user-14904.slice/user@14904.service/user.slice/user-libpod_pod_35e64b5202197234d9b5d1633a7fdbeae3f5cba98fe90d2e68f1c09f66a8c1bc.slice: Unit user-libpod_pod_35e64b5202197234d9b5d1633a7fdbeae3f5cba98fe90d2e68f1c09f66a8c1bc.slice was already loaded or has a fragment file.
not ok 191 |200| podman pod ps --filter in 1494ms
# tags: ci:parallel
# (from function `bail-now' in file test/system/helpers.bash, line 192,
# from function `die' in file test/system/helpers.bash, line 958,
# from function `run_podman' in file test/system/helpers.bash, line 560,
# from function `thingy_with_unique_id' in file test/system/200-pod.bats, line 654,
# in test file test/system/200-pod.bats, line 686)
# `thingy_with_unique_id "pod" "--infra=false --name $podname" \' failed
#
# [11:05:02.208657911] $ bin/podman pod create --infra=false --name p-1-t191-say7jqkn
# [11:05:02.370536337] afb9854847372de9ceee76bedfea700a555c9c3de375eaa214fb400daab52c81
#
# [11:05:02.381064307] $ bin/podman container create --pod p-1-t191-say7jqkn --name p-1-t191-say7jqkn-c1 quay.io/libpod/testimage:20240123 true
# [11:05:02.449486889] ca6bea2d0e611db24b9b303d0a504e440199420eb0df451c270d27750cfcb4f9
#
# [11:05:02.459926846] $ bin/podman container create --pod p-1-t191-say7jqkn --name p-1-t191-say7jqkn-c2 quay.io/libpod/testimage:20240123 true
# [11:05:02.538244015] 7516d2be79a05427c1f0f79e8ef86c50317c8398a6b743795ea37f7f23f66029
#
# [11:05:02.553599237] $ bin/podman container create --pod p-1-t191-say7jqkn --name p-1-t191-say7jqkn-c3 quay.io/libpod/testimage:20240123 true
# [11:05:02.693390469] 6583b505dc0a26986747ca9246e54f3eace1312c869cf8724d9ff717d95b7cc1
#
# [11:05:02.714082921] $ bin/podman pod create --infra=false --name p-2-t191-say7jqkn
# [11:05:03.071879542] Error: unable to create pod cgroup for pod 16d27df565b8a21a20e6fa5469897cbb02f4e9294137b1bae2b82b321ca6252c: creating cgroup user.slice/user-14904.slice/user@14904.service/user.slice/user-libpod_pod_16d27df565b8a21a20e6fa5469897cbb02f4e9294137b1bae2b82b321ca6252c.slice: Unit user-libpod_pod_16d27df565b8a21a20e6fa5469897cbb02f4e9294137b1bae2b82b321ca6252c.slice was already loaded or has a fragment file.
@edsantiago, there aren't a lot of changes in 1.17 per:
Would you be able to bisect to see which commit causes this? Unless, perhaps, someone has an idea what it might be...
I also assume that when you use 1.16.1, for example, there aren't any issues?
There was never a 1.16 for Fedora (any Fedora), so I jumped from 1.15.1 to 1.17.1.
I can try to bisect tomorrow, but can't promise anything. The bug is hard to reproduce, manifesting only about once in every three to four 15-minute test runs. If you (or anyone) feel like trying on your end:
hack/bats -T --tag='ci:parallel' &> some-log-file-you-can-examine-later
It is possible that the culprit is c6ecb3b5bb1e492fb8e38f90eab106ed9896cf18
My process: for each bisect step, run at least 10 iterations of my "reproducer". If I did not see the fragment slice
failure in ten or more runs, I marked as good. That failure typically manifests within 2-4 runs, but it's still possible that I marked a bad commit as good. Anyhow, I hope this narrows it down.
ARGH. I dnf downgraded back to 1.15-1, and now I'm seeing it there too:
# [08:49:43.881619178] $ bin/podman pod start 76a36ccaf2ebd1fe4ae1b0ff142006714edda1b9b91d0cd8c3ade3c5301433f4
# [08:49:44.400350904] Error: starting container 77f23dcb48b247a39687b133b3b51d0972b2b3481066689518a9c3c1f135484f: unable to create pod cgroup for pod 76a36ccaf2ebd1fe4ae1b0ff142006714edda1b9b91d0cd8c3ade3c5301433f4: creating cgroup user.slice/user-14904.slice/user@14904.service/user.slice/user-libpod_pod_76a36ccaf2ebd1fe4ae1b0ff142006714edda1b9b91d0cd8c3ade3c5301433f4.slice: Unit user-libpod_pod_76a36ccaf2ebd1fe4ae1b0ff142006714edda1b9b91d0cd8c3ade3c5301433f4.slice was already loaded or has a fragment file.
# Error: starting container 34dc713c0e058d41027e9b4ecb10186b01cbd723dc67e6033b6ce803bb31a75f: a dependency of container 34dc713c0e058d41027e9b4ecb10186b01cbd723dc67e6033b6ce803bb31a75f failed to start: container state improper
This is NOT new in 1.17. It reproduces fairly reliably with 1.15-1.fc40
, and the factor seems to be enabling 250-systemd
tests in parallel. In particular, the generate systemd - envar
test:
# bats test_tags=ci:parallel <<<<<<<<< THIS IS WHAT MAKES THINGS FAIL
@test "podman generate systemd - envar" {
This test itself never fails, or at least not yet. What it does is cause other tests to fail. I can find nothing in this short test that would not play well with others, unless it's all the systemctl daemon-reloads
?
Done for today. I don't know if this is a crun bug, or podman, or systemd.
So far it's a one-off, only on my laptop:
I don't have privs on this repo for adding the
flakes
tagI'm assuming this is a 1.17 issue, because I've never seen it before in gazillions of hours of testing.