[CI] Minimal jobs to check API on pull requests

maximusron commented 3 months ago

The current CI has an extensive suite of jobs covering every permutation of osx, windows, linux with both clang-repl and cling following every tool that uses CppInterOp. This A) takes up a lot of resources and B) are not really necessary to validate most pull requests changing the InterOp API.

I would propose segmenting a minimal workflow that have 3 sets of jobs each for windows, os x and ubuntu that run the InterOp build and test on the clang 16-18 and only run tools like the xeus products, cppyy and WASM once that is verified. Extensive build coverage can also be migrated to the repos of the tools themselves. Essentially it would be good if we don't tag every job with pull_request but have a set that would serve as a sanity check.

The other jobs that cover every permutation can be ran post merge to check main. This can speedup workflows especially when features are being developed in parallel

@vgvassilev @mcbarton would this be something we can do?

maximusron commented 3 months ago

Not sure if this is the case but it seems like each job runner is building LLVM. I was wondering if we can also merge runners that use the same LLVM clang-repl/cling build to test multiple tools if that can make the jobs more efficient. That might make debugging failures harder, but is there any other way we can improve this config?

mcbarton commented 3 months ago

Reducing the extensiveness of the CI shouldn't be done in my opinion. It may seem silly to have an extensive PR to test small changes, but small changes can have unexpected consequences, and the CI can help catch them before they end up in the main.

The main reason I see for the PRs being slow when having multiple PRs is due to limitations of the caching system and number of Github runners which can be used simultaneously accross the organisation.

The llvm cache has to be created for each PR, and cannot be shared with the main branch or between PRs. This will overwhelm the cache storage limit even with the number of jobs reduced. An alternative to the cache of llvm builds may be to custom VMs to run the workflows on. See the beta feature here

https://github.com/github/roadmap/issues/826

I don't see a solution to the number of simultaneous runners, but the custom vm option once available should allow for a dramatic speed up in PR run time.

vgvassilev commented 2 months ago

I believe the cache in main can be reused. I think for llvm17 we can just take the binary package as we do not have patches on top iirc. The same for llvm18...

mcbarton commented 2 months ago

We apply a patch for llvm 17 so I don't believe we can use the binary packages for this version. llvm 18 requires no patches, so a binary package could be used once a web assembley build is available (don't believe this is currently available on emscripten forge though).

@vgvassilev can you point me to an example reusing the cache from the main branch? I will then modify the ci based on this. This change should alleviate the problem for now.

vgvassilev commented 2 months ago

Other ci builds give priority on the cache from the main branch and the PRs do not need to build its own llvm caches. That probably means that the caches of main fit in the 10GB limits...

I do not think the current situation should stay because it practically slows down the development by days and @maximusron's bandwidth in that project will grow...

@alexander-penev, do you have an idea what we can do here?

mcbarton commented 2 months ago

Reusing the main branch cache and using binary packages for llvm 18 should bring the cache under 10GB, and speed up PR runs considerably. I think the name is this issue should be renamed since its about solving the caching issue rather than reducing the number of jobs.

vgvassilev commented 2 months ago

Ok, fair point, we need to figure out how to address it in a smart way...

maximusron commented 2 months ago

I understand we want to keep the coverage the same but I still stand by some reorganisation to make it easier to go through the logs. We want a set of jobs that are somewhat primary like just the CppInterOp builds on all clang versions. The passing of this workflow can be made the condition to run other sections for just cppyy, xeus (similar to how this PR #223 makes the runs smarter)

Something like this will definitely help because there can be developers working only on CppInterOp. Cppyy developers will need to look at InterOp + cppyy and xeus developers would look at InterOp + xeus.

mcbarton commented 1 month ago

@maximusron do you agree this issue of the ci taking too long has been fixed now?

maximusron commented 1 month ago

Closing as this is fixed by #229 #227 #249 #270 and probably a few other PR's here and there.

compiler-research / CppInterOp

[CI] Minimal jobs to check API on pull requests #221