Build/CI improvements brainstorm tracker

mattklein123 commented 2 years ago

We are seeing a lot of test flakes again which is a huge hindrance for developer productivity. This issue is for brainstorming various things that we might be able to build that would help us reduce CI time and reduce the number of reruns required for getting a successful test run. Once we do some brainstorming on high level ideas we can break out individual issues if we have people to work on them.

Some initial ideas to seed discussion:

Have a UI that we can use to look at granular test data at the individual build/test level including test time and flake % (assuming we run with flake attempts > 1). The UI should allow for inspection of the overall history of test runs, and then allow drilling down and sorting so that we can look at the worst offenders and target them for fixing. Optimally, we would somehow also track when a test was added and last modified so that we can get an idea of whether a test has become more flaky over time. Google has a project called https://github.com/google/resultstoreui which we might be able to use for this, or we could also talk to EngFlow, BuildBuddy, etc. about potential SaaS solutions that they might be willing to work with us on as an OSS project.
Increase flaky attempts to a larger number than we have now?
Make sure flaky attempts are running on all builds that we do?
Have some way of tracking the timeout for each test as well as how close we are to reaching it? This could be on the overall run as well as on a per test basis. This is something that could be surfaced in the above UI.
Track most resource intensive tests in terms of RAM, CPU time, etc. so that we can make sure we are getting good bang for our testing buck. Look at trends over time.
Have the UI able to report the worst offenders automatically via email, Slack plugin, GH issue, etc. so that we can make sure that things are getting triaged and dealt with.

cc @envoyproxy/envoy-maintainers @keith for further thoughts.

jmarantz commented 2 years ago

in the spirit of brainstorming, I want to +1 the bullet in particular that evaluates intensive tests in terms of ram/cpu-time, however IMO that's orthogonal to "flaky". My personal feeling is that flaky tests are much worse than slow tests.

I have another suggestion I think not covered by ^ -- enumerate and triage the flaky tests to maintainers and folks that have expressed interest in becoming one, and have each of us try to root-cause the flakiness. My general suspicion is usually that we are making assumptions about real-time behavior and either need to drastically increase sleep-time constants or move to simulated time. Doing the latter was such a huge slog for integration tests that I kind of gave up on it, but that was before @antoniovicente made sim-time integration with integration test infra much better.

daixiang0 commented 2 years ago

in the spirit of brainstorming, I want to +1 the bullet in particular that evaluates intensive tests in terms of ram/cpu-time, however IMO that's orthogonal to "flaky". My personal feeling is that flaky tests are much worse than slow tests.

I have another suggestion I think not covered by ^ -- enumerate and triage the flaky tests to maintainers and folks that have expressed interest in becoming one, and have each of us try to root-cause the flakiness. My general suspicion is usually that we are making assumptions about real-time behavior and either need to drastically increase sleep-time constants or move to simulated time. Doing the latter was such a huge slog for integration tests that I kind of gave up on it, but that was before @antoniovicente made sim-time integration with integration test infra much better.

Agree, many flaky tests are related to time, they pass at local but fail in CI, add wait time is only a workaround.

I have another suggestion as https://github.com/envoyproxy/envoy/pull/20443#issuecomment-1074883531, it would reduce some resources as more and more PRs of contrib become.

zhxie commented 2 years ago

I would like to suggest only run unit tests with code changes in a PR, but I think it is difficult to determine which tests have been affected by.

wbpcode commented 2 years ago

I would like to suggest only run unit tests with code changes in a PR, but I think it is difficult to determine which tests have been affected by.

This sounds great. May be it possible by analyzing bazel pkg dependencies.

phlax commented 2 years ago

I think the UI idea is a good one. I have thought about this previously and vaguely came to the conclusion that any ui would need to provide better information than what we get already from azure (prob not v hard!)

Not quite sure the best way to raise/notify failures, but i do know that i often go through some ~systematic steps to fish out errors so anything we can do to surface that would be a great help. One such idea i had was eg posting fix/diffs from azure -> github PRs, as i think many contributors are unaware of these (~related #20459)

I have been working on speeding up the precheck part of CI with some success - there is still a way to go on this.

I also have some plans to rationalize/reduce some of the precheck CI which will still take some time but many of the related issues are now resolved (mostly by shifting the python code to the pytooling repo), so im expecting some wins there in the not too distant future

alyssawilk commented 2 years ago

I mean for triage, maintainer on call is supposed to be looking at flaky tests, creating tracking bugs, and making sure they're assigned. I don't think we've been particularly rigorous about that and I think it'd make a big difference (based on the fact that when I go through and fix up tests, CI is way happier for weeks-to-months)

Also virtually no flakes I've looked at would be fixed by simtime, so I think we've gotten the low hanging fruit there,.

keith commented 2 years ago

I would like to suggest only run unit tests with code changes in a PR, but I think it is difficult to determine which tests have been affected by.

We could definitely use something like https://github.com/Tinder/bazel-diff to generate this for us, running fewer tests, even if it was pretty conservative about selection, could probably help a lot. Although theoretically bazel caching should be kicking in here for many cases, I haven't looked to see if that's not happening as much as we'd expect though?

lizan commented 2 years ago

I would like to suggest only run unit tests with code changes in a PR, but I think it is difficult to determine which tests have been affected by.

We could definitely use something like https://github.com/Tinder/bazel-diff to generate this for us, running fewer tests, even if it was pretty conservative about selection, could probably help a lot. Although theoretically bazel caching should be kicking in here for many cases, I haven't looked to see if that's not happening as much as we'd expect though?

The RBE cache should effectively doing this for CI, but that makes sense for local build improvement.

alyssawilk commented 2 years ago

ah correction, apparently https://source.cloud.google.com/results/invocations/2313fcd9-f070-479d-9d32-040f51825208/targets is what we want, the git project is more of a wrapper around that.

So an example tensorflow PR is this https://github.com/tensorflow/tensorflow/pull/55256 simple invocation https://source.cloud.google.com/results/invocations/698489c1-cb7f-42f8-9d79-3d0229a2efdc/targets;collapsed=

lfpino commented 2 years ago

Side note on the Engflow UI: https://github.com/envoyproxy/envoy-mobile already uses Engflow's UI for this. If you invoke bazel with --runs_per_test, for instance, you can see the runs/flakiness in the UI (invocation, PR). We're also working on a few ways to improve flakiness analysis through our UI and would be happy to hear what the OSS community finds useful in this area.

Disclaimer: I work at Engflow

yanavlasov commented 2 years ago

For improving build times we should look into using fastbuild mode. It is maybe an easy saving of 15-20% in build time. Some of our builds are in release mode and some are in dbg mode. I've tried to change this but our build config is shared between presubmit and postsubmit jobs which we then use to create artifacts for docker images and it was not easy to just flip build modes.

phlax commented 2 years ago

surfacing a suggestion from @wbpcode - adding some logic to detect which files changed, and only triggering relevant bits of CI for docs-only changes would allow those PRs to go through faster, and relieve some resource pressures

alyssawilk commented 2 years ago

FWIW also have resultstore working. sample succes run and failed runs here

https://source.cloud.google.com/results/invocations/7b06830e-36ce-497e-94c1-770fe085f26a/targets https://source.cloud.google.com/results/invocations/02cf0584-eabd-442a-a6bd-64c037f946e3/targets

phlax commented 2 years ago

one thing i have noticed - it would seem Engflow have "remote persistent workers"

this pattern would be a huge benefit to us - would speed up bazel massively for RBE, simplify aspect/worker patterns, and by extension potentially speed up just about everything

it would also make a massive difference to the projects carbon footprint

daixiang0 commented 2 years ago

one thing i have noticed - it would seem Engflow have "remote persistent workers"

envoyproxy / envoy

Build/CI improvements brainstorm tracker #20467