OSGeo / gdal

GDAL is an open source MIT licensed translator library for raster and vector geospatial data formats.
https://gdal.org
Other
4.66k stars 2.46k forks source link

Feedback wanted about limiting duplicated CI runs #9851

Closed echoix closed 2 months ago

echoix commented 2 months ago

Feature description

I want to hear about the use cases here before doing the PR I want to make, to better understand which of the two options I would choose.

Additional context

I observed that some jobs seem to run on push and pull_request for a same commit, like the PRs that are made from the backport mechanism. Since there is a lot of long workflows here, and only 20 runners for all the OSGeo organization, waiting for CI for 3-4 hours to start isn't uncommon, so avoiding jobs and even 30 secs to 1 min here and there counts on contribution-heavy days, (and we want to keep contributions).

I want to clear up the branch filtering on workflows on push events, such as they are only run for pushes to the special branches, like master and the release branches (the ones that will have the updated CI workflows). But for me to know if it is adapted to the reality here, I want to know:

Depending on the results of these questions (especially the second one), I can choose one of two options:

  1. The simple one: Limit workflows that run on pushes of branches to master and release/*. Thus, dependabot/* and backport-(PR)-to-release/* would not be run uselessly. This has the effect of also applying this to the forks, thus forks needing to create PRs (to the master branch of their forks, for example) for their new feature branches to run CI.
  2. A bit more complicated: If we want the same behaviour, but only when it is the base repo (https://github.com/OSGeo/gdal), since the on: event triggers don't support filtering by repo, and that we want fewer restrictions on forks than in the base repo, I would need to write a long if: condition to each job (not workflow, jobs inside workflows), that reimplements the pattern wanted (restrictions on pushes to branches when it is the base repo, same restrictions as before when not OSGeo/gdal repo). It wouldn't be as nice and clear, but should work. They will still cause an entry in the actions tab, with an id assigned, but will show up as skipped, since all jobs would have been skipped at the job level.

Currently, on debugging another CI improvement I want to file a PR for CodeQL, since I have a PR to my fork, I have most workflows do duplicated jobs, and use up all the runners available only for one commit (even the workflows for PRs don't manage to run without being queued a long time).

Otherwise, we could discuss which workflows need to run or not on pushes to branches in all repos (the base one and forks), and apply solution 1 to some workflows, and solution 2 to only specific ones, if it is needed.

rouault commented 2 months ago
  • Is there any other (active or future) branches in the repo that need to have workflows run on pushes (not PRs)?

I don't think so

In forks, especially the ones of the major maintainers (like Evan Rouault), do you make a PR to your fork before creating the one that will run against the base repo? Or you exclusively rely on the CI to run on every push, not only when you are ready for it to get pushed?

My typical workflows are shared between 2 options:

Thus, dependabot/ and backport-(PR)-to-release/ would not be run uselessly.

I've changed dependabot frequency to monthly as the weekly frequency was annoying, so not much of a concern And for "backport-(PR)-to-release/* we already have a "branches-ignore" item in our workflows to not run them on push, but just on pull requests. So I don't think there's much to optimize.

A potential idea for optimization, but I'm not sure if it could be done, and there would be subtleties, would be to avoid running workflows on master, once merged, when they already have complicated successfully in a pull request against master. But the subtelty is for example some workflows where we use pre-cached Docker images generated by master builds in the pull requests. This is typically for rolling releases such as alpine:edge or fedora:rawhide, which break upstream regularly due to their unstable nature. So in master run, we try to update the distro to the latest state of the packages we are interested in. If the update is successful, we refresh a Docker image with those updated packages, and the next pull requests will re-use that. If the update in master fails, then the Docker image isn't updated, and pull requests will still work against a working version of the pull request. This strategy is, as far as I remember, only used for jobs in linux_build.yml

echoix commented 2 months ago
  • for more complex work whose development lasts several days or weeks and where I know that CI will almost certainly be "red", I look at the results of workflows on push event in my rouault/GDAL repo, and when ready issue the final pull requiest against OSGeo/GDAL (I don't do PRs against my own fork)

For these bigger long-lived feature branches, would it be too big of a disturbance to create a PR when running checks are required?

I've changed dependabot frequency to monthly as the weekly frequency was annoying, so not much of a concern And for "backport-(PR)-to-release/* we already have a "branches-ignore" item in our workflows to not run them on push, but just on pull requests. So I don't think there's much to optimize.

I've rechecked just to make sure, since it wasn't what I saw, and yes, most of them have the right exclusions, only CodeQL misses that exclusion. Also, dependabot PRs create a branch in the repo, and all of the workflows run on both the commit push, and the updated PR. I could add the dependabot prefix as an exclusion too.

A potential idea for optimization, but I'm not sure if it could be done, and there would be subtleties, would be to avoid running workflows on master, once merged, when they already have complicated successfully in a pull request against master. But the subtelty is for example some workflows where we use pre-cached Docker images generated by master builds in the pull requests. This is typically for rolling releases such as alpine:edge or fedora:rawhide, which break upstream regularly due to their unstable nature. So in master run, we try to update the distro to the latest state of the packages we are interested in. If the update is successful, we refresh a Docker image with those updated packages, and the next pull requests will re-use that. If the update in master fails, then the Docker image isn't updated, and pull requests will still work against a working version of the pull request. This strategy is, as far as I remember, only used for jobs in linux_build.yml

That is indeed a bit tricky. I understand exactly what you mean, and I've searched this a lot in the last years, but why it isn't common is because of one important subtlety. I was eager to try out the merge queue feature, as I thought it would have adressed this kind of use case. With merge queues, once PRs are ready to be merged, they are added in a merge queue where the results of what the code will be one every other PR in the merge queue is applied on it (in special read-only branches), and runs CI for all of them. If one fails, it gets kicked out of the merge queue and placed back as an unmerged PR, and the rest of the PRs in the merge queue will start over. However, merge queue is just another trigger, and there is still a merge/commit of the special branch into master (or where the PR would have been afterwards), and it doesn't really eliminate the runs on pushes to these branches (you have even more CI runs, but at least you aren't always waiting if you pay for as much runners). It is however with this that you could understand the subtlety: even with all CI passing on a PR, and then more non-conflicting changes on the base branch afterwards, doesn't mean that CI would still be passing. It is the difference between code conflicts and semantic conflicts. You can read about it near the middle of this article that talks about this: https://blog.mergify.com/merge-conflicts-understanding-difference-between-semantic-and-code-conflicts/ So, the real way to be absolutely sure that master always passes, it to make it run all the checks, always. (Or making sure every PR is up-to-date with master before merging, but that means that once a PR is merged, any other PRs that were ready to be merged need to be updated again, and it ends up always waiting.) Thus, the compromise of having all merged PRs updated to master and not always merging one PR at a time, and without having semantic conflicts go unnoticed is to have all CI run on the final push, one after the other.

rouault commented 2 months ago

For these bigger long-lived feature branches, would it be too big of a disturbance to create a PR when running checks are required?

yes, that would be slightly annoying to have to do so (for example, because once you've issued a pull request, it is a bit more annoying in the github UI to create a pull request against another repo if I remember well), and that would move the issue on my branch, where I would now have builds on the push and the pull request...

This discussion also slightly intersects the one of https://github.com/OSGeo/gdal/issues/7306 which was aimed at something different. But there are a number of situations where as an experimented contributor I know for sure that, for simple changes, just running the build & tests on one CI setup (or just a couple ones) is more than enough to validate it, and it's useless to have it run on all others. And in other cases, when in doubt, running the full CI might be needed. So perhaps a solution would be to have this proposed github.event.label.name == 'extended-ci' condition for pull requests, and let PR reviewers decide if they must set the label or not.