flake: unable to create pod cgroup: slice was already loaded or has a fragment file

edsantiago commented 1 month ago

So far it's a one-off, only on my laptop:

not ok 180 |200| podman pod create - hashtag AllTheOptions in 1798ms
# tags: ci:parallel 
# (from function `bail-now' in file test/system/helpers.bash, line 192,
#  from function `die' in file test/system/helpers.bash, line 958, 
#  from function `run_podman' in file test/system/helpers.bash, line 560,
#  in test file test/system/200-pod.bats, line 255)
#   `run_podman pod create --name=$podname                \' failed 
#
...
# Successfully tagged localhost/infra_image_t180-z1vymisr:latest
# 86bbdac7ba77e0f3c5ec60dbd4bd2483f06413b792dcb6f2e7de89df5a0f40c1
#
# [08:35:22.656994340] $ bin/podman pod create --name=pod-t180-z1vymisr --pod-id-file=/dev/shm/podman_bats.j48uzw/pod-id-file --hostname=eaijuxkkq6.hostt180-z1vymisr.net --add-host kfam2ieko4.qqloyvhwed.xyz:172.20.201.132 --dns 172.20.62.167 --dns-search nnh0i0kyxmesh0y.abc --dns-option ndots:219 --publish 6035:5080 --label label-frw9BFgOv6q=labelvalue-XsxLIK0sqSZymXGJ933nsX --infra-image infra_image_t180-z1vymisr --infra-command /pause_t180-z1vymisr --infra-name infra_container_t180-z1vymisr
# [08:35:23.082958461] Error: unable to create pod cgroup for pod 18e6e06291e0a3bcf40769a6359fa5c45bcf6978cd67ea20e02c7fa9f31c28f1: creating cgroup user.slice/user-14904.slice/user@14904.service/user.slice/user-libpod_pod_18e6e06291e0a3bcf40769a6359fa5c45bcf6978cd67ea20e02c7fa9f31c28f1.slice: Unit user-libpod_pod_18e6e06291e0a3bcf40769a6359fa5c45bcf6978cd67ea20e02c7fa9f31c28f1.slice was already loaded or has a fragment file.

I don't have privs on this repo for adding the flakes tag

I'm assuming this is a 1.17 issue, because I've never seen it before in gazillions of hours of testing.

edsantiago commented 1 month ago

Seen again, different test:

not ok 192 |200| podman pod cleans cgroup and keeps limits in 924ms
# tags: ci:parallel
# (from function `bail-now' in file test/system/helpers.bash, line 192,
#  from function `die' in file test/system/helpers.bash, line 958,
#  from function `run_podman' in file test/system/helpers.bash, line 560,
#  in test file test/system/200-pod.bats, line 768)
#   `run_podman pod create --infra=$infra --memory=256M' failed
#
# [09:08:20.190255676] $ bin/podman pod create --infra=true --memory=256M
# [09:08:20.601836522] Error: unable to create pod cgroup for pod 35e64b5202197234d9b5d1633a7fdbeae3f5cba98fe90d2e68f1c09f66a8c1bc: creating cgroup user.slice/user-14904.slice/user@14904.service/user.slice/user-libpod_pod_35e64b5202197234d9b5d1633a7fdbeae3f5cba98fe90d2e68f1c09f66a8c1bc.slice: Unit user-libpod_pod_35e64b5202197234d9b5d1633a7fdbeae3f5cba98fe90d2e68f1c09f66a8c1bc.slice was already loaded or has a fragment file.

edsantiago commented 1 month ago

not ok 191 |200| podman pod ps --filter in 1494ms
# tags: ci:parallel
# (from function `bail-now' in file test/system/helpers.bash, line 192,
#  from function `die' in file test/system/helpers.bash, line 958,
#  from function `run_podman' in file test/system/helpers.bash, line 560,
#  from function `thingy_with_unique_id' in file test/system/200-pod.bats, line 654,
#  in test file test/system/200-pod.bats, line 686)
#   `thingy_with_unique_id "pod" "--infra=false --name $podname" \' failed
#
# [11:05:02.208657911] $ bin/podman pod create --infra=false --name p-1-t191-say7jqkn
# [11:05:02.370536337] afb9854847372de9ceee76bedfea700a555c9c3de375eaa214fb400daab52c81
#
# [11:05:02.381064307] $ bin/podman container create --pod p-1-t191-say7jqkn --name p-1-t191-say7jqkn-c1 quay.io/libpod/testimage:20240123 true
# [11:05:02.449486889] ca6bea2d0e611db24b9b303d0a504e440199420eb0df451c270d27750cfcb4f9
#
# [11:05:02.459926846] $ bin/podman container create --pod p-1-t191-say7jqkn --name p-1-t191-say7jqkn-c2 quay.io/libpod/testimage:20240123 true
# [11:05:02.538244015] 7516d2be79a05427c1f0f79e8ef86c50317c8398a6b743795ea37f7f23f66029
#
# [11:05:02.553599237] $ bin/podman container create --pod p-1-t191-say7jqkn --name p-1-t191-say7jqkn-c3 quay.io/libpod/testimage:20240123 true
# [11:05:02.693390469] 6583b505dc0a26986747ca9246e54f3eace1312c869cf8724d9ff717d95b7cc1
#
# [11:05:02.714082921] $ bin/podman pod create --infra=false --name p-2-t191-say7jqkn
# [11:05:03.071879542] Error: unable to create pod cgroup for pod 16d27df565b8a21a20e6fa5469897cbb02f4e9294137b1bae2b82b321ca6252c: creating cgroup user.slice/user-14904.slice/user@14904.service/user.slice/user-libpod_pod_16d27df565b8a21a20e6fa5469897cbb02f4e9294137b1bae2b82b321ca6252c.slice: Unit user-libpod_pod_16d27df565b8a21a20e6fa5469897cbb02f4e9294137b1bae2b82b321ca6252c.slice was already loaded or has a fragment file.

kwilczynski commented 1 month ago

@edsantiago, there aren't a lot of changes in 1.17 per:

https://github.com/containers/crun/releases/tag/1.17

Would you be able to bisect to see which commit causes this? Unless, perhaps, someone has an idea what it might be...

I also assume that when you use 1.16.1, for example, there aren't any issues?

edsantiago commented 1 month ago

There was never a 1.16 for Fedora (any Fedora), so I jumped from 1.15.1 to 1.17.1.

I can try to bisect tomorrow, but can't promise anything. The bug is hard to reproduce, manifesting only about once in every three to four 15-minute test runs. If you (or anyone) feel like trying on your end:

cherrypick https://github.com/containers/podman/pull/23275
hack/bats -T --tag='ci:parallel' &> some-log-file-you-can-examine-later

edsantiago commented 1 month ago

It is possible that the culprit is c6ecb3b5bb1e492fb8e38f90eab106ed9896cf18

My process: for each bisect step, run at least 10 iterations of my "reproducer". If I did not see the fragment slice failure in ten or more runs, I marked as good. That failure typically manifests within 2-4 runs, but it's still possible that I marked a bad commit as good. Anyhow, I hope this narrows it down.

edsantiago commented 1 month ago

ARGH. I dnf downgraded back to 1.15-1, and now I'm seeing it there too:

# [08:49:43.881619178] $ bin/podman pod start 76a36ccaf2ebd1fe4ae1b0ff142006714edda1b9b91d0cd8c3ade3c5301433f4
# [08:49:44.400350904] Error: starting container 77f23dcb48b247a39687b133b3b51d0972b2b3481066689518a9c3c1f135484f: unable to create pod cgroup for pod 76a36ccaf2ebd1fe4ae1b0ff142006714edda1b9b91d0cd8c3ade3c5301433f4: creating cgroup user.slice/user-14904.slice/user@14904.service/user.slice/user-libpod_pod_76a36ccaf2ebd1fe4ae1b0ff142006714edda1b9b91d0cd8c3ade3c5301433f4.slice: Unit user-libpod_pod_76a36ccaf2ebd1fe4ae1b0ff142006714edda1b9b91d0cd8c3ade3c5301433f4.slice was already loaded or has a fragment file.
# Error: starting container 34dc713c0e058d41027e9b4ecb10186b01cbd723dc67e6033b6ce803bb31a75f: a dependency of container 34dc713c0e058d41027e9b4ecb10186b01cbd723dc67e6033b6ce803bb31a75f failed to start: container state improper

edsantiago commented 1 month ago

This is NOT new in 1.17. It reproduces fairly reliably with 1.15-1.fc40, and the factor seems to be enabling 250-systemd tests in parallel. In particular, the generate systemd - envar test:

# bats test_tags=ci:parallel    <<<<<<<<< THIS IS WHAT MAKES THINGS FAIL
@test "podman generate systemd - envar" {

This test itself never fails, or at least not yet. What it does is cause other tests to fail. I can find nothing in this short test that would not play well with others, unless it's all the systemctl daemon-reloads?

Done for today. I don't know if this is a crun bug, or podman, or systemd.

containers / crun

flake: unable to create pod cgroup: slice was already loaded or has a fragment file #1560