[RFC] Make all CI jobs required for Merge

daniellepintz commented 3 years ago

Motivation

Currently our CI is red on master, which has the following negative repercussions:

It is confusing to new contributors (like myself) and requires ramp up time to learn which failures are "normal" and which ones are not. This may make people less likely to submit PRs, since it requires extra energy to siphon through the CI failures.
It slightly downgrades the public perception/reputation of Lightning, when people visit the repo just to check it out and see that CI is red.
It leads to cascading failures. For example the conda (3.7, 1.10) CI job has been failing consistently on everyone's PRs for a while. A few days ago it only had 4 test failures, but now it has 30 - https://github.com/PyTorchLightning/pytorch-lightning/pull/9118/checks?check_run_id=3570542595. This is because whether 4 tests or 30 tests are failing for one CI "job" it appears the same on the checks on your PR, so people may merge a PR that causes test failures.

Proposal

Make all CI jobs required for merging a PR. If we have a job running on a PR that is "okay" if it is failing, why have it at all? I propose we make all CI jobs required, and delete ones that we don't want to make required. This will also incentivize/force everyone to keep the CI green.

@PyTorchLightning/core-contributors

carmocca commented 3 years ago

I don't think marking all as required is feasible given the number of tests we run and the overall process. Some reasons:

Flaky tests can appear out of the blue (dependency update, bad network, environment change, bad caches, ...) which would block the full pipeline. They should be avoided at all costs but sometimes they just happen.
GitHub doesn't allow restarting individual jobs.
Network errors are bound to happen if running enough jobs. Keep in mind that some tests download datasets so it's not just the initial request to run the job.

To me, the key is in investing resources to improving everything around the CI. Some ideas:

Simplifiying the CI code
Reducing the duplication of jobs
Properly define the responsibilities and scope of each job run.
Better Docker image handling. Better image cache handling
Reducing test duplication, sorting the tests, relying more on unit testing.
Reducing and scoping the big integration tests (special tests, benchmarks, examples).
Use Dependabot to minimize failures due to dependency updates.
Enable alerts on failures for better visibility. Or a CI bot to comment on issues with summarization of the failed jobs.
Better tracking of CI times per step and per job to evaluate regressions in performance.
Document everything around the CI. The what, the why, and the how.
... many more potential improvements?

The community feels helpless when things fail because there is no resource for how our CI is structured on what are the goals of each job. They rely on us entirely.

But at this stage, setting all jobs as required would just make things worse.

If you or anybody from the community is motivated to improve the efficiency of all this, please reach out ❤️

daniellepintz commented 3 years ago

@carmocca Going back to this, I must say I still believe the best thing to do is go through all our CI jobs, and make each one either required, or delete it. Otherwise, I feel we will always be fighting a never-ending battle against the CI, because as long as people are able to merge PRs with failing jobs, they will (not out of malice, just by nature of the system). Currently there are no real incentives for people to fix failing CI jobs.

To address your concerns:

Flaky tests can appear out of the blue (dependency update, bad network, environment change, bad caches, ...) which would block the full pipeline. They should be avoided at all costs but sometimes they just happen.

True. But in this case here are the scenarios I see:

there is a quick fix (such as in the case of #9623), so no harm done, and the benefit is it actually gets fixed quickly rather than causing hours or days of broken CI before someone gets to it
there is no quick fix, but at least the issue is getting the attention it deserves and there are multiple motivated hands on deck trying to fix it. if we are unable to find a solution there are workarounds, like temporarily disabling the job, or overriding the job on a high-priority PR
if this happens often for a certain test, this might point to some fundamental flaws in the test's design and perhaps we should consider changing it or deleting it. after all, our tests should not be inherently flaky

GitHub doesn't allow restarting individual jobs.

True, but this is a pain point either way, and relatively minor.

Network errors are bound to happen if running enough jobs. Keep in mind that some tests download datasets so it's not just the initial request to run the job.

Then we should be running less jobs. If in the current state we already know some CI jobs are bound to fail, we are essentially resigning ourselves to a eternally-red CI, whether the jobs are required or not. We must fix this by deleting some jobs.

carmocca commented 3 years ago

You raise good points, Danielle. My proposal would be to keep things as they are right now, fix existing broken tests (you are on this already!), reduce flakiness, etc. Then, as our confidence in the CI increases, we can start marking things as required.

daniellepintz commented 3 years ago

Sounds great, I will continue working towards that!

carmocca commented 2 years ago

I think CI is in a much better state right now compared to back in september 🤞. Ideally we would mark everything as required but there's still sources of flakiness around.

Closing this for now.

daniellepintz commented 2 years ago

Agree we are in a better state! It is so nice to see the green checkmarks on PRs! Although there are still a few red X's we have to fix...

Lightning-AI / pytorch-lightning

[RFC] Make all CI jobs required for Merge #9445

Motivation

Proposal