Open Luap99 opened 1 week ago
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: Luap99
The full list of commands accepted by this bot can be found here.
The pull request process is described here
I don't see this being possible. For instance, any test in 120-load.bats
is going to conflict with anything in build
or systemd
.
I don't see this being possible. For instance, any test in
120-load.bats
is going to conflict with anything inbuild
orsystemd
.
That is what --no-parallelize-across-files is for different files are never executed in parallel, then we just set the BATS_NO_PARALLELIZE_WITHIN_FILE=true option in any file that cannot run in parallel which is what I do right now. This is really just following your suggestions https://github.com/containers/podman/pull/22698#issuecomment-2115129093 and see where we go from here
So now I just go over all the file and see where I can make it work in parallel, the potential savings are huge. If it turns out to be not worth it or to complex we can abandon this effort and look into other solutions.
type | distro | user | DB | local | remote | container |
---|---|---|---|---|---|---|
sys | rawhide | root | 29:26 | 20:56 | ||
sys | rawhide | rootless | 32:39 | |||
sys | fedora-40-aarch64 | root | 25:45 | 17:31 | ||
sys | fedora-40 | root | 30:20 | 19:31 | ||
sys | fedora-40 | rootless | 28:34 | 20:55 | ||
sys | fedora-39 | root | boltdb | 31:30 | 21:15 | |
sys | fedora-39 | rootless | boltdb | 33:24 | ||
sys | debian-13 | root | 30:50 | 22:23 | ||
sys | debian-13 | rootless | 34:11 |
Just as a base where we are right now. I will push some more changes to see if there is a noticeable speed-up compared to that.
type | distro | user | DB | local | remote | container |
---|---|---|---|---|---|---|
sys | rawhide | root | 29:19 | 18:37 | ||
sys | rawhide | rootless | 31:04 | |||
sys | fedora-40-aarch64 | root | 22:39 | 16:40 | ||
sys | fedora-40 | root | 28:58 | 20:04 | ||
sys | fedora-40 | rootless | 28:59 | 19:04 | ||
sys | fedora-39 | root | boltdb | 31:06 | 22:30 | |
sys | fedora-39 | rootless | boltdb | 29:56 | ||
sys | debian-13 | root | 29:07 | 21:13 | ||
sys | debian-13 | rootless | 34:25 |
Looks better an average I would say. Although not by much, I think the run to run variance is still very high I think I need to cut into the slower tests more to make a noticeable impact and maybe up the cpu cores so we can actually take advantage of the parallel runs.
That is what --no-parallelize-across-files is for
Grrr, I deserve that for trying to do a quick sloppy review on my way out the door. I will be more careful today.
type | distro | user | DB | local | remote | container |
---|---|---|---|---|---|---|
sys | rawhide | root | 28:20 | 17:40 | ||
sys | rawhide | rootless | 27:47 | |||
sys | fedora-40-aarch64 | root | 24:37 | 15:53 | ||
sys | fedora-40 | root | 27:54 | 17:53 | ||
sys | fedora-40 | rootless | 29:02 | 16:28 | ||
sys | fedora-39 | root | boltdb | 30:52 | !19:47 | |
sys | fedora-39 | rootless | boltdb | 28:38 | ||
sys | debian-13 | root | 28:31 | 20:24 | ||
sys | debian-13 | rootless | 29:59 |
Current timings, the time is going down but not as much as I would have hoped. Many slow tests cannot be run in parallel without major changes and our VMs currently only have two cores. I try to use 4 core VM like the int tests next.
Also one selinux test seem to have flaked which I need to figure out.
So the one things I am trying to understand is when running locally (12 threads) I see a huge delta from run to run, i.e. running the selinux file is around ~13s most of the times but then there are outliers where it is 45-50s.
I like to understand this before I continue pushing more changes into this PR.
Are you running with -T
? Can you post sample results?
Yes always with -T
(I wonder if we should have this by default in hack/bats?)
$ hack/bats -T --rootless --remote 410
--------------------------------------------------
$ bats test/system/410-selinux.bats
410-selinux.bats
✓ [410] podman selinux: confined container [1680]
✓ [410] podman selinux: container with label=disable [2259]
✓ [410] podman selinux: privileged container [2432]
✓ [410] podman selinux: privileged --userns=host container [2120]
✓ [410] podman selinux: --ipc=host container [2808]
✓ [410] podman selinux: init container [2241]
✓ [410] podman selinux: init container with --security-opt type [2794]
✓ [410] podman selinux: init container with --security-opt level&type [3211]
✓ [410] podman selinux: init container with --security-opt level [2869]
✓ [410] podman selinux: pid=host [3577]
✓ [410] podman selinux: container with overridden range [2342]
- [410] podman selinux: inspect kvm labels (skipped: runtime flag is not passed over remote) [1076]
✓ [410] podman selinux: inspect multiple labels [2645]
✓ [410] podman selinux: shared context in (some) namespaces [6962]
✓ [410] podman selinux: containers in pods share full context [4999]
✓ [410] podman selinux: containers in --no-infra pods do not share context [3368]
✓ [410] podman with nonexistent labels [2159]
✓ [410] podman selinux: check relabel [5928]
✓ [410] podman selinux nested [3569]
✓ [410] podman EnableLabeledUsers [3553]
- [410] podman selinux: check unsupported relabel (skipped: not applicable under rootless podman) [959]
21 tests, 0 failures, 2 skipped in 12 seconds
$ hack/bats -T --rootless --remote 410
--------------------------------------------------
$ bats test/system/410-selinux.bats
410-selinux.bats
✓ [410] podman selinux: confined container [1791]
✓ [410] podman selinux: container with label=disable [1867]
✓ [410] podman selinux: privileged container [40455]
✓ [410] podman selinux: privileged --userns=host container [2177]
✓ [410] podman selinux: --ipc=host container [2532]
✓ [410] podman selinux: init container [2242]
✓ [410] podman selinux: init container with --security-opt type [1877]
✓ [410] podman selinux: init container with --security-opt level&type [2923]
✓ [410] podman selinux: init container with --security-opt level [2824]
✓ [410] podman selinux: pid=host [3192]
✓ [410] podman selinux: container with overridden range [2505]
- [410] podman selinux: inspect kvm labels (skipped: runtime flag is not passed over remote) [1205]
✓ [410] podman selinux: inspect multiple labels [3443]
✓ [410] podman selinux: shared context in (some) namespaces [6226]
✓ [410] podman selinux: containers in pods share full context [5480]
✓ [410] podman selinux: containers in --no-infra pods do not share context [4180]
✓ [410] podman with nonexistent labels [2511]
✓ [410] podman selinux: check relabel [6003]
✓ [410] podman selinux nested [3538]
✓ [410] podman EnableLabeledUsers [3821]
- [410] podman selinux: check unsupported relabel (skipped: not applicable under rootless podman) [723]
21 tests, 0 failures, 2 skipped in 44 seconds
Note I am using remote here as I have been using it all day to reproduce selinux remote flake I saw but I am pretty sure I have seen this with local and other files as well.
✓ [410] podman selinux: privileged container [40455]
It is always a different tests that is slow so not sure where the pattern is
-T
in Makefile, but I didn't include hack/bats because I couldn't (quickly) think of a way to add a disable-T option. That no longer seems as important now. If you feel like enabling -T
in hack/bats here, I'm fine with that.It is always a different tests that is slow so not sure where the pattern is
Weird. My first assumption was a lock, but that seems unlikely if only one test is hiccuping. I've skimmed the source file and see nothing obvious.
Cherry-picked commits from https://github.com/containers/podman/pull/22831. Given it runs in parallel here maybe IO is a bigger bottleneck so I like to try it out.
Well the good news is we see good speed improvements
type | distro | user | DB | local | remote | container |
---|---|---|---|---|---|---|
sys | rawhide | root | 22:48 | !15:18 | ||
sys | rawhide | rootless | 24:39 | |||
sys | fedora-40-aarch64 | root | !20:52 | 13:23 | ||
sys | fedora-40 | root | !25:23 | 14:54 | ||
sys | fedora-40 | rootless | 23:35 | 13:46 | ||
sys | fedora-39 | root | boltdb | 25:11 | 16:00 | |
sys | fedora-39 | rootless | boltdb | 25:11 | ||
sys | debian-13 | root | 26:43 | 20:04 | ||
sys | debian-13 | rootless | 27:02 |
The said news is weird flakes...
<+-737873ns> # # podman run --cgroups=disabled --cgroupns=host --rm quay.io/libpod/testimage:20240123 cat /proc/self/cgroup
<+192ms> # time="2024-06-21T08:22:55-05:00" level=error msg="invalid internal status, try resetting the pause process with \"/var/tmp/go/src/github.com/containers/podman/bin/podman system migrate\": cannot read \"/run/containers/storage/overlay-containers/ff81f51b209a791e480cd6668e4e9494cb98d751356359a427a23efd80ba0d22/userdata/conmon.pid\": EOF"
<+008ms> # [ rc=1 (** EXPECTED 0 **) ]
# #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
# #| FAIL: exit code is 1; expected 0
# #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
<+202ms> # # podman system df
<+060ms> # time="2024-06-21T08:22:07-05:00" level=error msg="failed to get root file system size of container 5c2f1c54772b6e0a11447fa19722d2de9a9563a0f6f1fa6464f2395d95d2d4e9: container not known"
# time="2024-06-21T08:22:07-05:00" level=error msg="failed to get read/write size of container 5c2f1c54772b6e0a11447fa19722d2de9a9563a0f6f1fa6464f2395d95d2d4e9: container not known"
# TYPE TOTAL ACTIVE SIZE RECLAIMABLE
# Images 1 0 11.76MB 11.76MB (100%)
# Containers 1 0 0B 0B (0%)
# Local Volumes 0 0 0B 0B (0%)
# #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
# #| FAIL: Command succeeded, but issued unexpected warnings
# #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^ This one is not even part of a parallel test file so this is extremely weird and it failed in two different runs so likely very common.
There is also the selinux remote failure thing I looked at yesterday. I found the root cause for that but it might take a bit before I can fix it.
<+ > # # podman rmi -f quay.io/libpod/systemd-image:20240124
<+115ms> # Untagged: quay.io/libpod/systemd-image:20240124
# Deleted: 77def021e71fdab2c91a22b49cba046d8b1e62842583d2509305c70bec48e36f
#
<+159ms> # # podman ps -a --external
<+043ms> # CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
# 65fb253e2a2f quay.io/libpod/alpine:latest top -d 120 About a minute ago Removing c__B1jcHj6XA0
#
<+009ms> # # podman system df
<+043ms> # time="2024-07-01T10:41:39Z" level=error msg="failed to get root file system size of container 65fb253e2a2fc055c2eaec94241c9a60f02a2381ee1b16783ac8768c0405a47f: container not known"
# time="2024-07-01T10:41:39Z" level=error msg="failed to get read/write size of container 65fb253e2a2fc055c2eaec94241c9a60f02a2381ee1b16783ac8768c0405a47f: container not known"
# TYPE TOTAL ACTIVE SIZE RECLAIMABLE
# Images 1 0 13.14MB 13.14MB (100%)
# Containers 1 0 0B 0B (0%)
# Local Volumes 0 0 0B 0B (0%)
# #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
# #| FAIL: Command succeeded, but issued unexpected warnings
# #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# # [teardown]
Ok finally captured the issue, something in auto-update seems to leak a container (stucked in removing state for some reason). Auto-update runs them in system units so I guess systemd killing the podman process in the wrong moment for whatever reasons, that will be fun to debug :(
This is part 1 of major work to try to parallelize as much system tests as we can do speed overall test time up.
Problem is we can no longer perfom any leaks check or removal after each run which could be an issue. However as of commit https://github.com/containers/podman/commit/81c90f51c242ecb082c71cb53e80c5cbb89b2a6f we no longer do the leak check anyway in CI.
There will be tests that cannot be parallelized as they touch global state, i.e. image removal.
This commits used the bats -j option to run tests in parallel but also sets --no-parallelize-across-files to make sure we never run parallel across files. This allows us to disable parallel on a per file basis via the BATS_NO_PARALLELIZE_WITHIN_FILE=true option.
Right now only 001 and 005 are setup to run in parallel and this alone gives me a 30s total time improvement locally when running both files.
Does this PR introduce a user-facing change?