[RFC] Consider Alternate CI for running Podman tests

YJDoc2 commented 1 week ago

Background

Currently we have a CI for running podman e2e tests using Youki as the runtime. This is intended to serve as a conformance test to check that Youki can work with podman correctly. We only run tests with sudo, so this does not test rootless behavior. Currently Youki are not passing, details are below.

Motivation

The tests which are failing currently can be differentiated in 3 categories :

Tests failing because Youki has different/incorrect impl. For eg, a test is failing because the error message given by youki is not in the format expected by the test. These kind of tests can be fixed, and should be fixed so that youki can be used with podman.
Tests failing because missing configuration. There are some tests that depend on certain config files such as ssh certs etc. being present. The podman repo injects these via secrets, and we cannot provide these, hence these tests will always fail. Thankfully, most of these tests does not test things related to running containers, so we can safely ignore these.
Tests failing because of incorrect configuration. Github CI configures the CI VM in such way that we cannot override certain config which are needed for the tests. If you run these tests in the vagrant VM, these tests pass. Also some other issues are : bash and bats version in CI is older, hence some tests do not run correctly at all.

Thus in order to fix 1st and 3rd tests, I am proposing to run podman tests CI in some other CI provider, such as Cirrus CI or Circle CI. These both provide a free tier for public OSS repo with credits and unlike github, provide VM setup that we have complete control over. Note that podman itself uses Cirrus CI.

I am NOT proposing to move any other CI to these, as that is not needed, and does not make any sense.

Considerations

Some considerations I have done :

The average CI runtime for podman tests is 40 min. However, because it uses ubuntu, we need to compile some deps from source such as netavark and aardvark-dns. These take about 1-2 minutes. Also after fixing the bash and bats version, I expect few more minutes to be shaved off. Thus we need ~35*31 = 1085 minutes per month credits.
Cirrus CI provides 10,000 cpu-minutes on Linux VM per month (https://cirrus-ci.org/faq/#are-there-any-limits) . Note that the CPU here can be faster, so the tests might actually be a bit faster here, or can be potentially parallelized.
Circle CI provides 6000 build-minutes per month in free tier, but their Linux VM consumes 10 credits per minute. I'm not sure what is relation between build-minutes and credits, so this may or may not be viable.
We will need to figure out how to report if the tests fail. Currently we do not report results as we know the tests are failing. But once we make them pass, we will need to figure out how to report any failing tests, as the CI won't be running on github CI. My current plan is to have a script with github token passed in as secret, which will open a new issue, or comment on existing open issue with count and names of failing tests.

If there are no issues with moving the test CI to other provider, I can test both providers on my fork, and we can consider which one to finalize based on that.

cc: @containers/youki-maintainers

yihuaf commented 5 days ago

Tests failing because Youki has different/incorrect impl. For eg, a test is failing because the error message given by youki is not in the format expected by the test. These kind of tests can be fixed, and should be fixed so that youki can be used with podman.

These should be fixes as first priority and should be straightforward to fix/understand. Are issues in this category impacted by the CI provider environment? Based on the description in the issue, only 3rd category requires the changing in CI?

Thus we need ~35*31 = 1085 minutes per month credits.

We can save some more by running Mon - Fri or something similar patterns. I don't think we would loose much coverage if we reduce the nightly test frequency by a little.

Note that podman itself uses Cirrus CI.

Then we should start exploring the options here.

We will need to figure out how to report if the tests fail. Currently we do not report results as we know the tests are failing.

How does podman implement this? Is this something we can follow their lead?

If you can break down the tasks, we can help out on the effort.

utam0k commented 5 days ago

For your information: https://contribute.cncf.io/resources/project-services/hosted-tools/#cicd

YJDoc2 commented 2 days ago

These should be fixes as first priority and should be straightforward to fix/understand. Are issues in this category impacted by the CI provider environment? Based on the description in the issue, only 3rd category requires the changing in CI?

Hey, so the 3rd are the reason I opened this RFC, but fixing the env based failures also allows confirming which tests are failing due to env and which are actual failures. Right now a failing test could be either of them and to decide, one needs to run the failing test in a vagrant VM. We also do not have a way to keep checking that the fix added for the test works in CI. Once we are sure that no tests are failing due to env issues, the rest are either config or actual failures and can be fixed and kept in check via CI.

We can save some more by running Mon - Fri or something similar patterns. I don't think we would loose much coverage if we reduce the nightly test frequency by a little.

Yep! I had not considered this, thanks for pointing this out.

How does podman implement this? Is this something we can follow their lead?

Podman runs the CI on each commit/PR, and the Cirrus CI has github app which reports CI similar to native github CI. As we don't run the podman tests in PR, we need a way to explicitly report these failures. Maybe Cirrus/Circle CI itself has an option to report differently and we can use that.

If you can break down the tasks, we can help out on the effort.

For now, we first need to do a poc with both, with cirrus being preferable as podman itself uses it. Once we have a better idea, we port over the test CI. I feel both of these should be done by a single person. Once that is done, we can have a list of failing test which can be dealt with separately.

@utam0k : For your information: https://contribute.cncf.io/resources/project-services/hosted-tools/#cicd

Hey, I had seen this, but I feel it will take some time for the decision to be finalized on CNCF side, and it'd be better if we start by our own, and then we can port over to their infra.

containers / youki