conditional CI pipeline + migrate some jobs from travis to gh

mbien commented 2 years ago

Now that we have the basic CI features implemented (cancellable, restartable, junit-reports #4328 and a simple pipeline #3785), we can migrate some travis jobs to github.

I don't think we can migrate everything, since apache has a limited number of job runners which are shared between all apache projects - using them all every time someone opens a PR wouldn't be polite :)

so far migrated: sig test, java hints, javafx, enterprise, java modules, profiler, apisupport, extide, platform modules, webcommon, rat, lib validation

travis: 31 -> 14 jobs (~12.5h -> ~5h)
gh: 12 (fixed) -> 25 (conditional) (~4h -> ~3h - 8h) (counts individual matrix axis as job)

The good news is that a lot of time is saved simply by having the jobs as part of the pipeline, the initial work is now paying off, travis did build in every job. Another way to save CI time is by triggering long running jobs/steps only with certain PR labels, see below.

other changes:

travis uses the travis_retry command on many unreliable tests. Gh doesn't have that so I had to find a workaround with the help of a little script. The summary will now also show warnings:

we don't have to hide logs anymore since there doesn't seem to be a log limit. Additionally: logs are split by step so which makes inspection easier
some steps or jobs are now skipped based on selected PR labels
- ci:no-build for disabling the build entirely
- ci:all-tests for running all tests
- Java, JavaScript, Platform and PHP unlock some long running steps/jobs in their respective category
- Ant, Gradle, Maven or MX trigger the build tools job
- JavaDoc and API Change run javadoc jobs
- all jobs/steps run if the workflow was not started by a PR sync, e.g during merges
- the full text search allows to quickly check which labels influence CI, e.g by entering "ci java" or just "ci". tooltips will show the description too:
merged sig tests, rat, lib val into a single job for all paperwork. The job will also print the commit header so that reviewers can check if it looks ok
merged extide, ant, gradle, maven and mx into a cross-cluster build-tools job
moved some steps around, unreliable should run first for fast fails

neilcsmith-net commented 2 years ago

I don't think we can migrate everything since apache has a limited number of job runners which are shared between all apache projects - using them all every time someone opens a PR wouldn't be polite :)

No, but the situation with Travis isn't much different. What about taking this as a chance to try some of the heavier tests on different triggers? A daily basis on certain branches, or on PR merges?

mbien commented 2 years ago

I don't think we can migrate everything since apache has a limited number of job runners which are shared between all apache projects - using them all every time someone opens a PR wouldn't be polite :)

No, but the situation with Travis isn't much different. What about taking this as a chance to try some of the heavier tests on different triggers? A daily basis on certain branches, or on PR merges?

sure thats all no problem in principle. The main issue is to actually split the tests up and decide what should run when. Triggering based on code paths, or even PR labels is also an option, but this has been discarded every time we had the discussion. PHP for example almost became conditionally triggered as first experiment.

But i do agree, this project runs probably ~20h of testing per sync (travis alone is 13h). Thats simply too much. (edit, the pipeline will reduce this a bit since each job won't need to build everything)

neilcsmith-net commented 2 years ago

sure thats all no problem in principle. The main issue is to actually split the tests up and decide what should run when. Triggering based on code paths, or even PR labels is also an option, but this has been discarded every time we had the discussion.

Yes, I know! :exploding_head: I actually meant that as a hint that you might want to propose a suitable split, whatever makes sense to you. Then anyone who wants to -1 it can actually do the work to propose how to move things around to keep our CI usage at a workable (and polite to the rest of ASF) level. :smile:

mbien commented 2 years ago

@neilcsmith-net looks promising so far since some jobs on travis did spend most of the time building. When run as part of the pipeline the time goes down to a few minutes.

I don't really know which tests need a display. I might turn the xserver on for all tests on linux to keep things simple (done).

observations so far:

enterprise cluster tests error on 11 https://github.com/apache/netbeans/runs/7482519727?check_suite_focus=true
java modules tests fail on 11 https://github.com/apache/netbeans/actions/runs/2727064622#summary-7487660686
api support tests fail on 11 https://github.com/apache/netbeans/actions/runs/2727064622#summary-7487660542
~Java Hints batch1 fails, don't know why yet.~ -> requires xserver :)
Java Hints, ExtIde, Profiler, FX etc work on 11, yey!

mbien commented 2 years ago

the gradle test is failing, both locally and on github. I don't know why this test is working on travis. I used the same command lines and tried to replicate everything locally and couldn't find a reason. (even diffed the travis log to a local log)

Testcase: testCompilePreTrusted(org.netbeans.modules.gradle.java.classpath.ClassPathProviderImplTest):  FAILED
Closed project should be at least EVALUATED
junit.framework.AssertionFailedError: Closed project should be at least EVALUATED
    at org.netbeans.modules.gradle.java.classpath.ClassPathProviderImplTest.testCompilePreTrusted(ClassPathProviderImplTest.java:264)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at org.netbeans.junit.NbTestCase.access$200(NbTestCase.java:77)
    at org.netbeans.junit.NbTestCase$2.doSomething(NbTestCase.java:476)
    at org.netbeans.junit.NbTestCase$1Guard.run(NbTestCase.java:402)
    at java.base/java.lang.Thread.run(Thread.java:829)

fixed by #4456

mbien commented 2 years ago

rebased to latest master
made javadoc step conditional (triggered by 'JavaDoc' and 'API Change' labels)
this build will run everything using 'ci:all-tests' to determine the max total (edit: 8.2h) (min total is 3h)

mbien commented 2 years ago

another sync with master to resolve conflicts. Rebasing this PR can be tricky since so much changed in the files - easy to overlook something during merge.

going to add everyone for review since there was almost no feedback on the dev list!

mbien commented 2 years ago

One concern I have is, that the labels are not there, when a PR is created and would need a trigger. Forcing users/reviewers to do a forced push or request one sounds stupid and I remembered, that gitlab had an option to manually trigger CI/CD pipelines and it turns out github has something similar:

https://docs.github.com/en/actions/managing-workflow-runs/manually-running-a-workflow https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#workflow_dispatch

This would allow a reviewer add labels and then trigger a run. What do you think?

thanks for taking a look @matthiasblaesing I can always count on you for reviews :)

I fully agree that this is currently the most annoying aspect of this change, I haven't figured out how to solve this though. Here the issues:

if a workflow is manually triggered via workflow_dispatch it can't access the pr-context since there is none, it would simply run on a branch. So it would see no labels etc, it would also not show up below the PR as checkmark i am afraid (not tested but i can't see how it would)
there are various other triggers available. E.g it would be possible to trigger if someone adds or removes a label on the PR. but this would be overkill. Due to the limitations of the declarative yaml file I don't know how to trigger only on a certain label (I don't think its currently possible), it would run on every add or remove event.

I haven't given up on that. Ideas welcome.

We have still options though:

we could let unlabeled PRs run everything instead of the minimum job count, (ci:all-tests equivalent) would this be a better default? disadvantage: this could make committers lazy and never label anything which would cancel out this effort here
or we could check if it is even a problem in practice: how often does it happen that a new contributor submits a perfect PR which would create the awkward situation of a reviewer having to label the PR and request a sync? another aspect: maybe the default jobs already cover the PR well enough if its a trival change
(I should probably also update the PR template text to mention that PRs should be labeled before pressing create)

neilcsmith-net commented 2 years ago

if a workflow is manually triggered via workflow_dispatch it can't access the pr-context since there is none,

Does re-run all jobs not work after relabelling then?

we could let unlabeled PRs run everything instead of the minimum job count, (ci:all-tests equivalent) would this be a better default?

Maybe we should start with that as default anyway, but stated as time-limited for a while as people get used to it / feed back concerns?

Either way, make sure to do a mail on dev@ with [NOTICE] in the subject when this merges. Then no-one should be complaining they didn't see it! :wink:

mbien commented 2 years ago

if a workflow is manually triggered via workflow_dispatch it can't access the pr-context since there is none,

Does re-run all jobs not work after relabelling then?

unfortunately not, that would have been perfect. The workflow takes a snapshot of the context at the beginning it seems. It doesn't see label updates on restarts.

we could let unlabeled PRs run everything instead of the minimum job count, (ci:all-tests equivalent) would this be a better default?

Maybe we should start with that as default anyway, but stated as time-limited for a while as people get used to it / feed back concerns?

I had the same thought too. Let me put all-tests as default. Makes the transition easier

Either way, make sure to do a mail on dev@ with [NOTICE] in the subject when this merges. Then no-one should be complaining they didn't see it! wink

will do! and probably cc some oracle devs right?

There is something i want to try with workflow_dispatch in a testing repo. But as tom cruise would say: we need two miracles in succession. First the manual dispatch would have to actually appear under the PR if its even possible to run them on PR branches. Second: dispatch can use up to 10 input variables on its form, i could try to map them to labels somehow. I hope you can use them in job conditions since there are some things you can't use: e.g env vars for entirely undocumented reasons.

mbien commented 2 years ago

update: workflow_dispatch can only run on regular branches (e.g master). You can't select PR branches in the UI. So i can skip testing the rest.

neilcsmith-net commented 2 years ago

Does re-run all jobs not work after relabelling then?

unfortunately not, that would have been perfect.

OK, second thought. Does changing the pull request event types to include labeled work? https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#pull_request

mbien commented 2 years ago

OK, second thought. Does changing the pull request event types to include labeled work? https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#pull_request

this would trigger a new run every time a label is added which is probably too much since we have also the do-not-merge label etc

mbien commented 2 years ago

but what if we abuse reopened for this?

there is also "unlocked" which we will probably also rarely use

neilcsmith-net commented 2 years ago

this would trigger a new run every time a label is added which is probably too much since we have also the do-not-merge label etc

Given this enables many of the tests to eventually not run without required labels, I'm not sure that's a huge issue in practice, and it's certainly more obvious. If the label trigger works, IMO we could also consider going back to minimal tests by default from the start.

mbien commented 2 years ago

this would trigger a new run every time a label is added which is probably too much since we have also the do-not-merge label etc

Given this enables many of the tests to eventually not run without required labels, I'm not sure that's a huge issue in practice, and it's certainly more obvious. If the label trigger works, IMO we could also consider going back to minimal tests by default from the start.

when I read the doc this indeed seemed to be the obvious way how to solve this issue. But this dissipated quickly:

labeled events are fired for each label. So if you put 3 labels on a pr it will start 3 workflows + the open event which makes 4. We have the concurrency rule which would shut down all workflows except the newest, but this would still create a lot of noise on the /actions page
we use do-not-merge/stale/bugfix etc labels for meta-info. This can become a problem when someone accidentally resets a build someone else is looking at
i sometimes go around and add labels to PRs so that they will go the right category on the feature list. Since not all labels are used as conditions, this would make me fixup less since it would cause unnecessary builds just for cosmetics.
when someone would bulk label a bunch of PRs at once via the /pulls page it would simply cause too much building at once IMO

how about using the unlocked event as "secret" trigger for the workflow? I just tested it in a sandbox and it worked well

I don't think it sends mails (I am not sure, difficult to test with just one account, in fact, let me test this by locking this PR, you can tell me if you got a mail :)), close/reopen does
it is less invasive than 'reopened'
maybe one day gh will support better ways of doing this. e.g via adding a custom button to the checks or restarting workflows with a fresh context optionally

I assume that most devs will try to label PRs correctly right away or will sync PRs several times anyway during review which gives enough opportunities to update labels. For new contributors and the event of a PR which does not need any changes, reviewers could use this semi-secret trick to trigger a new build after labeling the PR.

neilcsmith-net commented 2 years ago

No email that I can see. All the options seem pretty ugly.

Is the actual added label in the event? And if so, easy to filter by prefix ci:?

Use gh cli to read the PR labels directly rather than from the event?

Use a comment trigger and filter by text (eg. <RETEST THE DAMN THING>)???

Yep, all ugly. From my perspective, go with what you think makes most sense initially. Can always be tweaked later. Anyway, you're in charge of explaining any secret triggers! :smile:

mbien commented 2 years ago

ready to integrate from my side. Added the "unlocked" trigger and put a info line into the PR template.

I think its better to integrate it as is without squashing everything. It makes it easier to check if for example there was a mistake the migration of certain jobs without having to go through everything.

mbien commented 2 years ago

updated the label descriptions for labels which influence CI. Description text has been prefixed with [ci] which allows filtering.
Added a screenshot with an example to the PR.

full list: https://github.com/apache/netbeans/labels?q=ci

@vieiro I think it is going to be fairly intuitive. Even if you don't label anything there will be still about 3h of CI time spent (maybe we can reduce this further in future). This won't be a fine grained configuration, Existing PRs which update maven features are already labeled with "Maven" right now. Once this is merged, the same label will also toggle maven related tests which are aggregated in a build tools job.

mbien commented 2 years ago

merging. Going to send the [NOTICE] mail to the dev list too as @neilcsmith-net advised.

apache / netbeans

conditional CI pipeline + migrate some jobs from travis to gh #4431