confidential-containers / cloud-api-adaptor

Ability to create Kata pods using cloud provider APIs aka the peer-pods approach
Apache License 2.0
44 stars 71 forks source link

tests/e2e: Libvirt Env tests are unstable #1831

Open stevenhorsman opened 2 months ago

stevenhorsman commented 2 months ago

We see occasional (anecdotally <20% of the time) failures on the libvirt nightly CI, which seems to always (so far) pass on re-run and now we've seen in on a PR test, so it's becoming more of an obstacle, so we should investigate it when we get the chance

=== RUN   TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithImageOnly
=== RUN   TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithImageOnly/EnvVariablePeerPodWithImageOnly_test
    assessment_runner.go:262: timed out waiting for the condition
--- FAIL: TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithImageOnly (600.10s)
    --- FAIL: TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithImageOnly/EnvVariablePeerPodWithImageOnly_test (600.10s)
=== RUN   TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithImageAndDeployment
=== RUN   TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithImageAndDeployment/EnvVariablePeerPodWithBoth_test
    assessment_runner.go:262: timed out waiting for the condition
--- FAIL: TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithImageAndDeployment (600.04s)
    --- FAIL: TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithImageAndDeployment/EnvVariablePeerPodWithBoth_test (600.04s)
RUN   TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithDeploymentOnly
=== RUN   TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithDeploymentOnly/EnvVariablePeerPodWithDeploymentOnly_test
    assessment_runner.go:262: timed out waiting for the condition
--- FAIL: TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithDeploymentOnly (600.06s)
    --- FAIL: TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithDeploymentOnly/EnvVariablePeerPodWithDeploymentOnly_test (600.06s)
stevenhorsman commented 1 month ago

This is getting worse and we are hitting it multiple times on each PR now. I've tried running this test locally and in about 8 re-runs it worked every time, so I'm not sure of the cause of the failure. In the short term I think we need to skip it in the CI to stop it blocking PRs.

stevenhorsman commented 3 weeks ago

It is possible that this is related to the image-pull changes as Chengyu is touch the config merge code in https://github.com/kata-containers/kata-containers/pull/9695, so after this, we should try re-testing this.

stevenhorsman commented 2 weeks ago

Hmm, this is suspicious, now the e2e tests related to env are skipped I've seen:

=== RUN   TestLibvirtCreatePeerPodAndCheckWorkDirLogs
=== RUN   TestLibvirtCreatePeerPodAndCheckWorkDirLogs/WorkDirPeerPod_test
    assessment_runner.go:262: timed out waiting for the condition
--- FAIL: TestLibvirtCreatePeerPodAndCheckWorkDirLogs (600.16s)
    --- FAIL: TestLibvirtCreatePeerPodAndCheckWorkDirLogs/WorkDirPeerPod_test (600.16s)

start failing, so maybe it's related to something before now being cleaned up, or the workdir has the same issue?