eclipse-ankaios / ankaios

Eclipse Ankaios provides workload and container orchestration for automotive High Performance Computing (HPC) software.
https://eclipse-ankaios.github.io/ankaios/
Apache License 2.0
60 stars 18 forks source link

Optimize CI/CD steps #91

Closed lingnoi closed 3 months ago

lingnoi commented 10 months ago

Description

Optimize execution time CI/CD steps by for instance, reusing builds from previous build steps, etc.

Goals

All CI/CD steps is executed only once for example, build executable for running utests, stests only once.

Final result

Summary

Instead of using a cache solution and sequential builds benefit from each other and from the cache, the pipeline was restructured to use more parallelization, which saves a lot of more time. The cache cleanup time at the is unpredictable and sometimes very high in the worst case leading to pull request verifications of +11min. With parallelization of jobs the pipeline runs more stable in approx. 4 min. Introducing a cache is only micro-optimization because our dependencies / own code ratio is not very high and the main pipeline runtime is essentially determined by the system tests as the longest runtime. So, keeping the job for executing the system tests as minimal as possible (build + system tests execution) is the key success to that project time point and state. Any other build job that runs anyway shorter is negligible.

Tasks

inf17101 commented 10 months ago

I have seen that sometimes the post actions are taking a lot of time, for example the rust cache, see here as an example run: image

It takes 2 min.

krucod3 commented 9 months ago

For PRs we should only run per commit:

On main all actions should run incl. deb packages.

inf17101 commented 9 months ago

I think it would be also great to enable cargo clippy that does extra code checks to catch common mistakes and to avoid code smells. We shall make the CI/CD pipeline fail if clippy detects something. I have seen that it is common to enable clippy in CI/CD as well.

inf17101 commented 9 months ago

Like discussed with @krucod3: In addition, we shall mark long running system tests and run them only in a nightly pipeline run. Because in #67 a stest might be added that takes longer and maybe there are more tests in the future (for example testing restart policy behaviors and so on).

inf17101 commented 4 months ago

For better readability and single-point-of-change, I wanted to extract the conditions to check if certain steps shall be skipped into a environment variable and check the value of the env variable each time where needed. Then we can assign a more expressive name what the check does.

According to the GH Actions docs is the env context available in steps and jobs, ... . But when inserting the if:... to check the env variable inside a job condition, a bug is there, saying that .env is not known in job: https://github.com/eclipse-ankaios/ankaios/actions/runs/9114018663

image

There are several issues opened and the workarounds are leading to more scripting: https://github.com/actions/runner/issues/2372

When inserting the env variable check into a step condition, then it works.

For that reason, I have to copy the conditional expressions to every step and job that shall be skipped otherwise it would be not consistent. But this is not nice... :-(

inf17101 commented 4 months ago

We can also think about changing the structure of workflows a little bit to group things within workflows together that belong together.

For example:

.github/workflows:

pr_validation.yml
merge_validation.yml
release.yml
documentation.yml

Then the merge_validation.yml and the release.yml can reuse the pr_validation.yml workflow. But as long as we keep this rust-cache, we must initialize the cache in each separated job using the rust build artifacts, which is overhead.

I am currently testing a little bit if the cache is really time saving or just overhead for the project. But the free Github actions resources in the cloud are not delivering reliable measurement info.

inf17101 commented 3 months ago

I have tested some different CI/CD setups and summarized up the following by putting it into a visualization:

image

Currently, our build job is using the rust-cache and does all the jobs sequentially. In Rust, different builds can benefit from previous ones, e.g. a cargo clippy can benefit from a previous cargo build or cargo test to run with less execution time. However, in our project the main time consuming pipeline task is to run the system tests. Only the run without build takes approx. 2.5 min, and in future potentially more time because of more system tests.

This means if we put the system tests execution into a sequence of other steps (building unit tests before and cargo clippy and all that stuff...), then the execution times are summed up, because each step is executed one after the other. In addition, the rust-cache saves a few seconds for each build (approx. 9 - 15 seconds). However, the cache setup time is approx. 46 sec and if the cache content is changing a lot through the new build and also some old artifacts must be cleaned up the worst case is that cache cleanup takes ~4 minute in the pipeline. So, the cache is not efficient at the moment, maybe caused by our dependecy / own code relationship is too low (like mentioned in the rust-cache docs https://github.com/Swatinem/rust-cache?tab=readme-ov-file#cache-effectiveness). Also the cache compresses the cargo home directory and the target directory with tar and uploads that into the GH Actions cache. However the cache is limited to 10 GB on Github Actions and if I see how big a cache object is (approx 1.5 - 2GB) then a few PRs and the cache is completely needless, because the cache is very quickly full and all get a cache miss. I tried to test also with filling the cache only on push on main and using the cache in the PRs to build only the differences to main there, but the benefit was also not high, we save a few seconds build time (20 seconds) but the cache setup time was around 46 sec and also the cache cleanup takes time on huge amount of code changes.

The key here is too use not sequential steps in one GH Actions job, but use the parallelization mechanism of jobs inside GH Actions workflows to speed up. Our pipeline runtime is mainly driven by the execution times of the system tests, so the system tests must run as its own job freed from previous build steps. The system tests requires only a debug build before. So those builds must be put together to be able to run the system tests. The rest of the other pipeline steps can be distributed into other GH Actions jobs and it does not matter if we download the dependencies here and repeat build steps, because it does not save a lot of time currently in the project. The cost to build everything new in each separated job is less compared to the long runner (system test).

Summary:

Using the rust-cache in our project is for now only micro optimization. Parallelization is more efficient here. Before pipeline refactoring and using cache + sequential build steps inside one job: best case ~6.x minutes, worst case: ~11.x min whole pipeline run.

After pipeline refactoring and using parallization and no cache: best case: ~4.x minutes and worst case: time for debug build + system test run

Here are some PRs with times of the old pipeline setup (just a few current PRs, we cannot consider all previous PRs and calculate the average because there were less system tests in the past which essentially drive the time upwards.): <html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns="http://www.w3.org/TR/REC-html40">

https://github.com/eclipse-ankaios/ankaios/actions/runs/9127799150 | ~ 6 min -- | -- https://github.com/eclipse-ankaios/ankaios/actions/runs/9127901595 | ~ 6 min https://github.com/eclipse-ankaios/ankaios/actions/runs/9001513093 | ~ 6 min https://github.com/eclipse-ankaios/ankaios/actions/runs/8981237379 | ~ 6 min https://github.com/eclipse-ankaios/ankaios/actions/runs/8814055638 | ~ 6 min https://github.com/eclipse-ankaios/ankaios/actions/runs/9128542234 | 11 min

And here is a Action run for the new pipeline setup: https://github.com/eclipse-ankaios/ankaios/actions/runs/9187866027 => 4 min 50 seconds

Overall just through parallelization we save approx 1.5 - 1.8 minutes.

I am still working on another optimization to exclude some long running system tests from PR merge like mentioned in some comments above and to run them only on merge on main or nightly. We can do that with defining meaningful tags inside the robot framework and then we can exclude / include tests depending on the tag. I have implemented that and saved additional 40 sec of runtime by excluding one "retry..." stest to run only on merge into main.

I reduced the time now from ~6.5 minutes to ~ 4 min.

In additon, I am currently thinking about sccache (https://github.com/mozilla/sccache) which is a compiler cache, but that requires to build a new devcontainer because a 'cargo install' is needed. But I think it is heavy to increase the cargo install commands inside the container build. This means it is more effort to test.

inf17101 commented 3 months ago

@krucod3: I reduced the time now from ~6.5 minutes to ~ 4 min, see detailed explanation in the comment above. After removing the cache and better splitting the pipeline steps for more parallelization, I am wondering if we shall enable all the pipeline checks again. All disabled pipeline steps like Code Coverage, Debian packages, ... has significant lower runtime than the system test execution step, therefore these builds do not degrade the overall runtime of the pipeline. Enabling them would only have some advantages but not disadvantages. I will now enabling them again because pipeline runtime is not affected through this, but more checks have more benefit. Please let me know if you see it different and want me to turn them off again.

Last result of https://github.com/eclipse-ankaios/ankaios/actions/runs/9220440597: image

I think we can save even more time having a smaller container image because this takes also 1 minute to download and initialize all the time.

But I keep this for now and wait for your feedback.

We need also to open a PR for changing the required status checks in the self service because the pipeline structure has been changed. I have opened one https://github.com/eclipse-ankaios/.eclipsefdn/pull/5

krucod3 commented 3 months ago

@inf17101, parallelizing the steps sounds like a good idea and indeed means we can enable more checks when they run beside the others 👍 As for the smaller dev container, it would definitely be great to do it, but it is outside of the scope here so we will tackle it later.

inf17101 commented 3 months ago

I added a user doc and a fix for the condition checking if long-runtime stests shall be skipped on pull request verification. I tested a final release build in my fork and that has worked.