pkg/machine/e2e: fix broken cleanup

Luap99 commented 4 days ago

pkg/machine/e2e: use tmp file for connections

On linux and macos the connections are stored under the home dir by default so it is not a problem there but on windows we first check the APPDATA env and use this dir as config storage. This has the problem that it is not cleaned up after each test as such connections might leak into the following test causing failues there.

Fixes https://github.com/containers/podman/issues/22844

pkg/machine/e2e: fix broken cleanup

Currently all podman machine rm errors in AfterEach were ignored. This means some leaked and caused issues later on, see https://github.com/containers/podman/issues/22844.

To fix it first rework the logic to only remove machines when needed at the place were they are created using DeferCleanup(), however DeferCleanup() does not work well together with AfterEach() as it always run AfterEach() before DeferCleanup(). As AfterEach() deletes the dir the podman machine rm call can not be done afterwards.

As such migrate all cleanup to use DeferCleanup() and while I have to touch this fix the code to remove the per file duplciation and define the setup/cleanup once in the global scope.

Does this PR introduce a user-facing change?

None

Luap99 commented 4 days ago

@edsantiago @baude @ashley-cui PTAL

openshift-ci[bot] commented 4 days ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: edsantiago, Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/containers/podman/blob/main/OWNERS)~~ [Luap99,edsantiago] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

edsantiago commented 4 days ago

LGTM

ashley-cui commented 3 days ago

/lgtm

Luap99 commented 3 days ago

I think I noticed one weird failure pattern:

$ podman machine init --disk-size 11 --image /private/tmp/ci/podman-machine-daily.aarch64.applehv.raw foo1
  [FAILED] Timed out after 240.001s.
...

-> next test
$ podman machine init --disk-size 11 --image /private/tmp/ci/podman-machine-daily.aarch64.applehv.raw f357ac67e822
  Error: truncate /private/tmp/ci/podman_test9067091/.local/share/containers/podman/machine/applehv/foo1-arm64.raw: no such file or directory
  Machine init complete
  To start your machine run:

    podman machine start f357ac67e822
-> this one is a success despite the error message?! And notice how the error path contains the machine name from the previous failed test.

I see this pattern in basically all my failed runs here.

My best guess is that was caused by https://github.com/containers/podman/pull/23068. I know we had the flake before but the fact that it got that bad all of the sudden suggest to me that something must have changed that causes this. Looking at the runs there it took 7 tries: https://cirrus-ci.com/task/5748607108775936

I also pushed https://github.com/containers/podman/pull/23162 that should hopefully add useful debug output to find the root cause.

Luap99 commented 3 days ago

It took 13 tries to get the mac machine test to pass

containers / podman

pkg/machine/e2e: fix broken cleanup #23154

Does this PR introduce a user-facing change?