containers / podman

Podman: A tool for managing OCI containers and pods.
https://podman.io
Apache License 2.0
22.37k stars 2.31k forks source link

test/system: run tests in parallel where possible #23048

Open Luap99 opened 1 week ago

Luap99 commented 1 week ago

This is part 1 of major work to try to parallelize as much system tests as we can do speed overall test time up.

Problem is we can no longer perfom any leaks check or removal after each run which could be an issue. However as of commit https://github.com/containers/podman/commit/81c90f51c242ecb082c71cb53e80c5cbb89b2a6f we no longer do the leak check anyway in CI.

There will be tests that cannot be parallelized as they touch global state, i.e. image removal.

This commits used the bats -j option to run tests in parallel but also sets --no-parallelize-across-files to make sure we never run parallel across files. This allows us to disable parallel on a per file basis via the BATS_NO_PARALLELIZE_WITHIN_FILE=true option.

Right now only 001 and 005 are setup to run in parallel and this alone gives me a 30s total time improvement locally when running both files.

Does this PR introduce a user-facing change?

None
openshift-ci[bot] commented 1 week ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/containers/podman/blob/main/OWNERS)~~ [Luap99] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
edsantiago commented 1 week ago

I don't see this being possible. For instance, any test in 120-load.bats is going to conflict with anything in build or systemd.

Luap99 commented 1 week ago

I don't see this being possible. For instance, any test in 120-load.bats is going to conflict with anything in build or systemd.

That is what --no-parallelize-across-files is for different files are never executed in parallel, then we just set the BATS_NO_PARALLELIZE_WITHIN_FILE=true option in any file that cannot run in parallel which is what I do right now. This is really just following your suggestions https://github.com/containers/podman/pull/22698#issuecomment-2115129093 and see where we go from here

So now I just go over all the file and see where I can make it work in parallel, the potential savings are huge. If it turns out to be not worth it or to complex we can abandon this effort and look into other solutions.

Luap99 commented 1 week ago
type distro user DB local remote container
sys rawhide root 29:26 20:56
sys rawhide rootless 32:39
sys fedora-40-aarch64 root 25:45 17:31
sys fedora-40 root 30:20 19:31
sys fedora-40 rootless 28:34 20:55
sys fedora-39 root boltdb 31:30 21:15
sys fedora-39 rootless boltdb 33:24
sys debian-13 root 30:50 22:23
sys debian-13 rootless 34:11

Just as a base where we are right now. I will push some more changes to see if there is a noticeable speed-up compared to that.

Luap99 commented 1 week ago
type distro user DB local remote container
sys rawhide root 29:19 18:37
sys rawhide rootless 31:04
sys fedora-40-aarch64 root 22:39 16:40
sys fedora-40 root 28:58 20:04
sys fedora-40 rootless 28:59 19:04
sys fedora-39 root boltdb 31:06 22:30
sys fedora-39 rootless boltdb 29:56
sys debian-13 root 29:07 21:13
sys debian-13 rootless 34:25

Looks better an average I would say. Although not by much, I think the run to run variance is still very high I think I need to cut into the slower tests more to make a noticeable impact and maybe up the cpu cores so we can actually take advantage of the parallel runs.

edsantiago commented 1 week ago

That is what --no-parallelize-across-files is for

Grrr, I deserve that for trying to do a quick sloppy review on my way out the door. I will be more careful today.

Luap99 commented 1 week ago
type distro user DB local remote container
sys rawhide root 28:20 17:40
sys rawhide rootless 27:47
sys fedora-40-aarch64 root 24:37 15:53
sys fedora-40 root 27:54 17:53
sys fedora-40 rootless 29:02 16:28
sys fedora-39 root boltdb 30:52 !19:47
sys fedora-39 rootless boltdb 28:38
sys debian-13 root 28:31 20:24
sys debian-13 rootless 29:59

Current timings, the time is going down but not as much as I would have hoped. Many slow tests cannot be run in parallel without major changes and our VMs currently only have two cores. I try to use 4 core VM like the int tests next.

Luap99 commented 1 week ago

Also one selinux test seem to have flaked which I need to figure out.

Luap99 commented 1 week ago

So the one things I am trying to understand is when running locally (12 threads) I see a huge delta from run to run, i.e. running the selinux file is around ~13s most of the times but then there are outliers where it is 45-50s.

I like to understand this before I continue pushing more changes into this PR.

edsantiago commented 1 week ago

Are you running with -T? Can you post sample results?

Luap99 commented 1 week ago

Yes always with -T (I wonder if we should have this by default in hack/bats?)

$ hack/bats -T --rootless --remote  410
--------------------------------------------------
$ bats  test/system/410-selinux.bats
410-selinux.bats
 ✓ [410] podman selinux: confined container [1680]
 ✓ [410] podman selinux: container with label=disable [2259]
 ✓ [410] podman selinux: privileged container [2432]
 ✓ [410] podman selinux: privileged --userns=host container [2120]
 ✓ [410] podman selinux: --ipc=host container [2808]
 ✓ [410] podman selinux: init container [2241]
 ✓ [410] podman selinux: init container with --security-opt type [2794]
 ✓ [410] podman selinux: init container with --security-opt level&type [3211]
 ✓ [410] podman selinux: init container with --security-opt level [2869]
 ✓ [410] podman selinux: pid=host [3577]
 ✓ [410] podman selinux: container with overridden range [2342]
 - [410] podman selinux: inspect kvm labels (skipped: runtime flag is not passed over remote) [1076]
 ✓ [410] podman selinux: inspect multiple labels [2645]
 ✓ [410] podman selinux: shared context in (some) namespaces [6962]
 ✓ [410] podman selinux: containers in pods share full context [4999]
 ✓ [410] podman selinux: containers in --no-infra pods do not share context [3368]
 ✓ [410] podman with nonexistent labels [2159]
 ✓ [410] podman selinux: check relabel [5928]
 ✓ [410] podman selinux nested [3569]
 ✓ [410] podman EnableLabeledUsers [3553]
 - [410] podman selinux: check unsupported relabel (skipped: not applicable under rootless podman) [959]

21 tests, 0 failures, 2 skipped in 12 seconds

$ hack/bats -T --rootless --remote  410
--------------------------------------------------
$ bats  test/system/410-selinux.bats
410-selinux.bats
 ✓ [410] podman selinux: confined container [1791]
 ✓ [410] podman selinux: container with label=disable [1867]
 ✓ [410] podman selinux: privileged container [40455]
 ✓ [410] podman selinux: privileged --userns=host container [2177]
 ✓ [410] podman selinux: --ipc=host container [2532]
 ✓ [410] podman selinux: init container [2242]
 ✓ [410] podman selinux: init container with --security-opt type [1877]
 ✓ [410] podman selinux: init container with --security-opt level&type [2923]
 ✓ [410] podman selinux: init container with --security-opt level [2824]
 ✓ [410] podman selinux: pid=host [3192]
 ✓ [410] podman selinux: container with overridden range [2505]
 - [410] podman selinux: inspect kvm labels (skipped: runtime flag is not passed over remote) [1205]
 ✓ [410] podman selinux: inspect multiple labels [3443]
 ✓ [410] podman selinux: shared context in (some) namespaces [6226]
 ✓ [410] podman selinux: containers in pods share full context [5480]
 ✓ [410] podman selinux: containers in --no-infra pods do not share context [4180]
 ✓ [410] podman with nonexistent labels [2511]
 ✓ [410] podman selinux: check relabel [6003]
 ✓ [410] podman selinux nested [3538]
 ✓ [410] podman EnableLabeledUsers [3821]
 - [410] podman selinux: check unsupported relabel (skipped: not applicable under rootless podman) [723]

21 tests, 0 failures, 2 skipped in 44 seconds

Note I am using remote here as I have been using it all day to reproduce selinux remote flake I saw but I am pretty sure I have seen this with local and other files as well.

Luap99 commented 1 week ago

✓ [410] podman selinux: privileged container [40455]

It is always a different tests that is slow so not sure where the pattern is

edsantiago commented 1 week ago

22886 enabled -T in Makefile, but I didn't include hack/bats because I couldn't (quickly) think of a way to add a disable-T option. That no longer seems as important now. If you feel like enabling -T in hack/bats here, I'm fine with that.

It is always a different tests that is slow so not sure where the pattern is

Weird. My first assumption was a lock, but that seems unlikely if only one test is hiccuping. I've skimmed the source file and see nothing obvious.

Luap99 commented 1 week ago

Cherry-picked commits from https://github.com/containers/podman/pull/22831. Given it runs in parallel here maybe IO is a bigger bottleneck so I like to try it out.

Luap99 commented 1 week ago

Well the good news is we see good speed improvements

type distro user DB local remote container
sys rawhide root 22:48 !15:18
sys rawhide rootless 24:39
sys fedora-40-aarch64 root !20:52 13:23
sys fedora-40 root !25:23 14:54
sys fedora-40 rootless 23:35 13:46
sys fedora-39 root boltdb 25:11 16:00
sys fedora-39 rootless boltdb 25:11
sys debian-13 root 26:43 20:04
sys debian-13 rootless 27:02

The said news is weird flakes...

<+-737873ns> # # podman run --cgroups=disabled --cgroupns=host --rm quay.io/libpod/testimage:20240123 cat /proc/self/cgroup
<+192ms> # time="2024-06-21T08:22:55-05:00" level=error msg="invalid internal status, try resetting the pause process with \"/var/tmp/go/src/github.com/containers/podman/bin/podman system migrate\": cannot read \"/run/containers/storage/overlay-containers/ff81f51b209a791e480cd6668e4e9494cb98d751356359a427a23efd80ba0d22/userdata/conmon.pid\": EOF"
<+008ms> # [ rc=1 (** EXPECTED 0 **) ]
         # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
         # #| FAIL: exit code is 1; expected 0
         # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
<+202ms> # # podman system df
<+060ms> # time="2024-06-21T08:22:07-05:00" level=error msg="failed to get root file system size of container 5c2f1c54772b6e0a11447fa19722d2de9a9563a0f6f1fa6464f2395d95d2d4e9: container not known"
         # time="2024-06-21T08:22:07-05:00" level=error msg="failed to get read/write size of container 5c2f1c54772b6e0a11447fa19722d2de9a9563a0f6f1fa6464f2395d95d2d4e9: container not known"
         # TYPE           TOTAL       ACTIVE      SIZE        RECLAIMABLE
         # Images         1           0           11.76MB     11.76MB (100%)
         # Containers     1           0           0B          0B (0%)
         # Local Volumes  0           0           0B          0B (0%)
         # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
         # #| FAIL: Command succeeded, but issued unexpected warnings
         # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

^^^ This one is not even part of a parallel test file so this is extremely weird and it failed in two different runs so likely very common.

There is also the selinux remote failure thing I looked at yesterday. I found the root cause for that but it might take a bit before I can fix it.

Luap99 commented 1 day ago
<+     > # # podman rmi -f quay.io/libpod/systemd-image:20240124
<+115ms> # Untagged: quay.io/libpod/systemd-image:20240124
         # Deleted: 77def021e71fdab2c91a22b49cba046d8b1e62842583d2509305c70bec48e36f
         #
<+159ms> # # podman ps -a --external
<+043ms> # CONTAINER ID  IMAGE                         COMMAND     CREATED             STATUS      PORTS       NAMES
         # 65fb253e2a2f  quay.io/libpod/alpine:latest  top -d 120  About a minute ago  Removing                c__B1jcHj6XA0
         #
<+009ms> # # podman system df
<+043ms> # time="2024-07-01T10:41:39Z" level=error msg="failed to get root file system size of container 65fb253e2a2fc055c2eaec94241c9a60f02a2381ee1b16783ac8768c0405a47f: container not known"
         # time="2024-07-01T10:41:39Z" level=error msg="failed to get read/write size of container 65fb253e2a2fc055c2eaec94241c9a60f02a2381ee1b16783ac8768c0405a47f: container not known"
         # TYPE           TOTAL       ACTIVE      SIZE        RECLAIMABLE
         # Images         1           0           13.14MB     13.14MB (100%)
         # Containers     1           0           0B          0B (0%)
         # Local Volumes  0           0           0B          0B (0%)
         # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
         # #| FAIL: Command succeeded, but issued unexpected warnings
         # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
         # # [teardown]

Ok finally captured the issue, something in auto-update seems to leak a container (stucked in removing state for some reason). Auto-update runs them in system units so I guess systemd killing the podman process in the wrong moment for whatever reasons, that will be fun to debug :(