containers / podman

Podman: A tool for managing OCI containers and pods.
https://podman.io
Apache License 2.0
23.99k stars 2.43k forks source link

CI: misc parallel flakes #23479

Open edsantiago opened 4 months ago

edsantiago commented 4 months ago

Hodgepodge of parallel-system-test flakes that don't seem to fit anywhere else. I think most of these just need something like

    # Expectation (in seconds) of when we should time out. When running
    # parallel, allow 2 more seconds due to system load
    local expect=4
    if [[ -n "$PARALLEL_JOBSLOT" ]]; then
        expect=$((expect + 2))
    fi
    assert $delta_t -le $expect \
           "podman kube play did not get killed within $expect seconds"
x x x x x x
sys(6) podman(6) rawhide(2) root(3) host(6) sqlite(4)
debian-13(2) rootless(3) boltdb(2)
fedora-39(2)
Honny1 commented 3 months ago

Hi @edsantiago, I think increasing the expected time in |035| podman logs - --until --follow journald is not a good idea. Since the time can vary with the actual load on the machine or vary due to the scheduler when running lots of parallel runs, the test should check if the command gets 3s of logs, not how long it takes.

For the [035] podman logs - multi k8s-file test, I would say that the first container did not finish the job and was put to sleep due to CI machine load. Probably should wait for both containers to finish their work before reading the logs.

In the test [035] podman logs - --since --follow journald I would say that when running in parallel the journald is used by multiple containers, so it will be necessary to increase the timeout time to give the container more time to write to the journald, and also perform a check, end of journald content.

Luap99 commented 3 months ago

I think increasing the expected time in |035| podman logs - --until --follow journald is not a good idea. Since the time can vary with the actual load on the machine or vary due to the scheduler when running lots of parallel runs, the test should check if the command gets 3s of logs, not how long it takes.

Keep in mind the same race exists for the ctr process so there is no way of knowing what 3s of logs are because depending on scheduling the ctr process might have only written a few lines not 30 with the sleep 0.1 interval so it is impossible to know if the writer side didn't write fast enough or if the reader looses messages. As such the process should exit after 3s match is simple and easy in theory but of course also has the timing problem. And we also want to check the the logs process actually exits in time.

I am however not sure how the rounding works with the built in $SECONDS in bash, maybe it would be safer to take the time before and after in ms and compare that?

Honny1 commented 3 months ago

@Luap99 I tested $SECONDS and time in ms. I found that $SECONDS are not accurate because the time is rounded to whole seconds. So if the t0 is 1856ms, but the $SECONDS is still 1, this inaccuracy causes the command to be at most only about 150ms late which is less variation than I observed between test runs (the time was around 3150-3650ms). At higher workloads, this delay can be larger.

github-actions[bot] commented 2 months ago

A friendly reminder that this issue had no activity for 30 days.