Closed lingnoi closed 3 months ago
I have seen that sometimes the post actions are taking a lot of time, for example the rust cache, see here as an example run:
It takes 2 min.
For PRs we should only run per commit:
On main all actions should run incl. deb packages.
I think it would be also great to enable cargo clippy
that does extra code checks to catch common mistakes and to avoid code smells. We shall make the CI/CD pipeline fail if clippy detects something. I have seen that it is common to enable clippy in CI/CD as well.
Like discussed with @krucod3: In addition, we shall mark long running system tests and run them only in a nightly pipeline run. Because in #67 a stest might be added that takes longer and maybe there are more tests in the future (for example testing restart policy behaviors and so on).
For better readability and single-point-of-change, I wanted to extract the conditions to check if certain steps shall be skipped into a environment variable and check the value of the env variable each time where needed. Then we can assign a more expressive name what the check does.
According to the GH Actions docs is the env context available in steps and jobs, ... . But when inserting the if:...
to check the env variable inside a job condition, a bug is there, saying that .env is not known in job: https://github.com/eclipse-ankaios/ankaios/actions/runs/9114018663
There are several issues opened and the workarounds are leading to more scripting: https://github.com/actions/runner/issues/2372
When inserting the env variable check into a step condition, then it works.
For that reason, I have to copy the conditional expressions to every step and job that shall be skipped otherwise it would be not consistent. But this is not nice... :-(
We can also think about changing the structure of workflows a little bit to group things within workflows together that belong together.
For example:
.github/workflows:
pr_validation.yml
merge_validation.yml
release.yml
documentation.yml
Then the merge_validation.yml and the release.yml can reuse the pr_validation.yml workflow. But as long as we keep this rust-cache, we must initialize the cache in each separated job using the rust build artifacts, which is overhead.
I am currently testing a little bit if the cache is really time saving or just overhead for the project. But the free Github actions resources in the cloud are not delivering reliable measurement info.
I have tested some different CI/CD setups and summarized up the following by putting it into a visualization:
Currently, our build job is using the rust-cache and does all the jobs sequentially. In Rust, different builds can benefit from previous ones, e.g. a cargo clippy can benefit from a previous cargo build or cargo test to run with less execution time. However, in our project the main time consuming pipeline task is to run the system tests. Only the run without build takes approx. 2.5 min, and in future potentially more time because of more system tests.
This means if we put the system tests execution into a sequence of other steps (building unit tests before and cargo clippy and all that stuff...), then the execution times are summed up, because each step is executed one after the other. In addition, the rust-cache saves a few seconds for each build (approx. 9 - 15 seconds). However, the cache setup time is approx. 46 sec and if the cache content is changing a lot through the new build and also some old artifacts must be cleaned up the worst case is that cache cleanup takes ~4 minute in the pipeline. So, the cache is not efficient at the moment, maybe caused by our dependecy / own code relationship is too low (like mentioned in the rust-cache docs https://github.com/Swatinem/rust-cache?tab=readme-ov-file#cache-effectiveness). Also the cache compresses the cargo home directory and the target directory with tar and uploads that into the GH Actions cache. However the cache is limited to 10 GB on Github Actions and if I see how big a cache object is (approx 1.5 - 2GB) then a few PRs and the cache is completely needless, because the cache is very quickly full and all get a cache miss. I tried to test also with filling the cache only on push on main and using the cache in the PRs to build only the differences to main there, but the benefit was also not high, we save a few seconds build time (20 seconds) but the cache setup time was around 46 sec and also the cache cleanup takes time on huge amount of code changes.
The key here is too use not sequential steps in one GH Actions job, but use the parallelization mechanism of jobs inside GH Actions workflows to speed up. Our pipeline runtime is mainly driven by the execution times of the system tests, so the system tests must run as its own job freed from previous build steps. The system tests requires only a debug build before. So those builds must be put together to be able to run the system tests. The rest of the other pipeline steps can be distributed into other GH Actions jobs and it does not matter if we download the dependencies here and repeat build steps, because it does not save a lot of time currently in the project. The cost to build everything new in each separated job is less compared to the long runner (system test).
Summary:
Using the rust-cache in our project is for now only micro optimization. Parallelization is more efficient here. Before pipeline refactoring and using cache + sequential build steps inside one job: best case ~6.x minutes, worst case: ~11.x min whole pipeline run.
After pipeline refactoring and using parallization and no cache: best case: ~4.x minutes and worst case: time for debug build + system test run
Here are some PRs with times of the old pipeline setup (just a few current PRs, we cannot consider all previous PRs and calculate the average because there were less system tests in the past which essentially drive the time upwards.): <html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns="http://www.w3.org/TR/REC-html40">
Description
Optimize execution time CI/CD steps by for instance, reusing builds from previous build steps, etc.
Goals
All CI/CD steps is executed only once for example, build executable for running utests, stests only once.
Final result
Summary
Instead of using a cache solution and sequential builds benefit from each other and from the cache, the pipeline was restructured to use more parallelization, which saves a lot of more time. The cache cleanup time at the is unpredictable and sometimes very high in the worst case leading to pull request verifications of +11min. With parallelization of jobs the pipeline runs more stable in approx. 4 min. Introducing a cache is only micro-optimization because our dependencies / own code ratio is not very high and the main pipeline runtime is essentially determined by the system tests as the longest runtime. So, keeping the job for executing the system tests as minimal as possible (build + system tests execution) is the key success to that project time point and state. Any other build job that runs anyway shorter is negligible.
Tasks