[DX Streamline] Establish flaky test eradication program

BigLep commented 5 months ago

Done Criteria

There is a metric that shows how much lotus is plagued by test flakiness. Eradication of flaky tests should have an impact on this metric.
It is self-service for anyone to find this metric.
It is self-service for anyone to find flaky tests and how flaky they are.
There is a public SOP/runbook on how a maintainer should handle when a flaky test is discovered.

Why Important

Flaky tests are insidious. We have them in our codebase, and it has two primary problems:

Slows down the velocity we can merge because judgement needs to be applied for whether it's safe to merge and we need to rely on those with admin permissions (e.g., maintainers) to merge.
It's compounding. Once we have some, it's easier for more to slip in, which makes it harder to dig out later.

We want to have the standard of "we only merge if CI is green", and that is much easier bar to hold if flakiness has been eradicated.

User/Customer

Maintainers and contributors.

Notes

This issue can be completed even before all flaky tests are eradicated. This issue is done when we have credible visibility and plan to get there.
It's acknowledged that handling flaky tests is not a new effort for Lotus development. This issue is more about building on existing efforts and getting more robust so that we win this battle.
Lotus has the "impact/test-flakiness" label for historically identified flaky tests.
Handling flaky tests is nothing unique to Lotus. There is lots of prior art and tooling. Some examples / discussion I'm aware of:
- go-libp2p: they have a score badge on their readme which links to deeper analysis at https://marcopolo.github.io/FlakyTests/
- Observable notebooks have been created concerning this topic: https://observablehq.com/@observablehq/integration-test-flakiness
- Github projects: https://github.com/F-Secure/flaky-tests-detection
- Articles on the topic:
  - https://engineering.fb.com/2020/12/10/developer-tools/probabilistic-flakiness/
  - https://dl.acm.org/doi/10.1145/3377813.3381370
- Dedicated service/tool: https://buildpulse.io/

BigLep commented 5 months ago

I'm pasting in some comments that had from Slack DMs:

Valuable comments from @marcopolo who lead this effort for go-libp2p:

Jotting a couple thoughts:

Observable notebook: A fork from https://observablehq.com/@observablehq/integration-test-flakiness
- The observable notebook requires users to load all the data from github on each view. The above source loads fast because it uses sample data.
- For more than a couple of requests, it requires a github access token secret. Secrets now require a paid plan from observable (this wasn't the case originally)
- It's not fine grain enough. It tells you which GitHub action workflow is flaky, but not which test is flaky. I didn't find it useful to know that go-test was flaky, I wanted to know that TestFoo was flaky.
FlakyTests Julia notebook:
- This is very much hacked together. I make no claim on the quality of the code :slightly_smiling_face:
- In contrast to the Observable notebook this setup gives me:
  - Viewers don't have to load data. It's ready for them instantly.
  - A background cron job updates the data via GitHub Actions.
  - I can track the actual tests that failed.

If you want to do something similar for lotus, I wouldn't recommend using the FlakyTests julia notebook. It's idiosyncratic and someone will spend more time understanding it than rewriting it. I know you didn't ask but here's what I'd recommend:

Try BuildPulse: https://buildpulse.io/ . I haven't used it, but it might be something that might work and get this done with no effort in the short term.
Building this yourself isn't hard either, here's roughly the steps:
1. Define what flakiness means:
  1. If a run fails in an earlier attempt, but succeeds in a later attempt it is flaky. A specific run id is per commit, a run id can have multiple attempts.
2. Download the logs of failed workflows with multiple attempts that have at least one successful attempt.
3. grep through the logs to find the failed test names and include OS information, test category, and any other relevant fields
4. Put this all in a sqlite DB.
5. If running this in a GitHub action, produce the sqlite DB as an artifact.
6. Now you have a clean version of the data.
  1. Visualizations are easy.
  2. So is interactive exploration.
  3. Even load it into observable: https://observablehq.com/@observablehq/sql-chart

rvagg commented 5 months ago

For more than a couple of requests, it requires a github access token secret. Secrets now require a paid plan from observable (this wasn't the case originally)

I don't believe this is the case, after getting tangled up not finding secrets I discovered that they were just moved to a new location. https://observablehq.com/team/@USERNAME/settings/secrets. So I got mine working without a paid plan.

MarcoPolo commented 5 months ago

Glad to be corrected, thanks!

rvagg commented 5 months ago

Slightly modified version of the Observable notebook to include some sorting. This is looking across all branches, so we end up with some non-representative failures (e.g. if a particular branch is doing something funky that impacts one of the tests only, e.g. I think fevm is failing on my frequently updated rvagg/niporep branch because it has a new actors bundle and the test needs to be tuned to it), but it's a lot more data than just looking at master so is quite interesting. This is all flakies >10%.

galargh commented 5 months ago

It's not fine grain enough. It tells you which GitHub action workflow is flaky, but not which test is flaky. I didn't find it useful to know that go-test was flaky, I wanted to know that TestFoo was flaky.

That's the great thing about the itest setup in lotus, each test gets their own job so they show up separately in the results by default. Otherwise, we'd have to dig in into the artifacts, parse reports, etc. It is not impossible or even that hard by any means, but if we don't have to do that, that's great!

To sum it all up, I guess the only missing piece is a periodic data aggregation so that we don't have to query GitHub API directly for all individual runs any time someone wants to inspect the flakiness. We could run it on schedule and store the results in the artifacts. We get 90 day retention tops here, but that should be enough to see improvement/deterioration over time. And if it's not, we can defer to a more long-term storage on S3 for example.

BigLep commented 5 months ago

Thanks all for the updates.

On the observable notebook side: @rvagg : are you able to share your modified notebook given https://observablehq.com/team/@USERNAME/settings/secrets ?

Concerning BuildPulse: I'll still connect with them this week to get a sense of how their offering compares.

rvagg commented 5 months ago

When I try to make it public:

To collaborate in a private notebook you need Pro I believe.

BigLep commented 5 months ago

To collaborate in a private notebook you need Pro I believe.

Oh. I was thinking you could make the notebook public but it wouldn't work for anyone else unless they have a secret with the same name in their account (that stores the GITHUB API key). Even if someone had to fork it, that seems fine.

(I know we're still exploring and not committed to the Observable path, but if we do go down it, maybe it makes sense to get a FilOz teamspace?)

BigLep commented 5 months ago

each test gets their own job so they show up separately in the results by default.

@galargh: thanks for educating me. Do any of these jobs have multiple in them or are they always one to one? I thought I had seen multiple PASSes in some jobs, but I didn't look too deeply across a very large sample. If we do have multiple tests in a job, I wonder if we'll still want the deeper analysis.

BigLep commented 5 months ago

I know we need to decide on some next steps here. We've had the observable notebook and BuildPulse ideas. I also saw a comment from IPDX about extending GitHub Monitoring to cover test flakiness using the “passing rerun heuristic”.

I can't give more thought on this at the moment. I welcome IPDX to drive here with a decision-table or pros/cons list.

I do think its worthwhile for us to try BuildPulse. My meeting with their CEO was good and I like what I saw. It's self-service for us to get started, and the first week is free while we're exploring. The only requirements are that we do junit test output (I don't know if we are but I assume it's easy enough to do?) and wire in a GitHub Action.

@galargh : i think it's fine to spend the time to do these things even if we don't go with BuildPulse so folks can try the tool firsthand and in case it helps guide any alternative approach we take.

Thanks!

BigLep commented 4 months ago

For anyone watching, FYI that there is work in progress on this.

There is the BuildPulse PoC (being enabled by https://github.com/filecoin-project/lotus/pull/12225) and IPDX also created some reporting in Grafana at https://ipdx.grafana.net/dashboard/snapshot/TfD5JfybgB0odLmipdZniNgayqr1zv3c (screenshot below)

ipdx grafana net_dashboard_snapshot_TfD5JfybgB0odLmipdZniNgayqr1zv3c_var-interval=1d var-org=$__all var-repo=$__all var-branch= var-workflow= var-job= var-label= var-granularity= var-limit=

BigLep commented 4 months ago

@galargh : concerning the Grafana dashboard...

Is there missing text after "( per )"?
What causes a job to be donated as flaky?
I assume within a GH Job, there may be multiple tests functions. I assume with this approach it isn't possible for us to see which specific test is causing the problem?

(We don't have to invest time here on this yet. Once we compare with the BuildPulse PoC, we can see where we want to double-down.)

BigLep commented 2 months ago

2024-09-04 status update: we're still looking at BuildPulse:https://app.buildpulse.io/@filecoin-project/lotus

We've been seeing these issues:

1) Suite and Class are set as "command-line-arguments" Example: https://app.buildpulse.io/@filecoin-project/lotus/tests/22990925769

@biglep investigation: this appears to be the case for all itests when looking at the test.yml workflow (example). All of the "itest" jobs output with "command-line-arguments". According to ChatGPT:

When you see “command-line-arguments” in the output of gotestsum, it usually indicates that the Go test command is being run in a package that doesn’t have a defined package name, or it’s being run in the root of a module where multiple packages exist.

Go assigns “command-line-arguments” as a placeholder name when the test files are compiled directly from the command line without specifying a package. This can also happen if you’re running tests across multiple packages or in a directory that isn’t a Go package itself.

To avoid seeing “command-line-arguments,” make sure you are running gotestsum from within a directory that contains a Go package, or specify the package path explicitly when running the command.

I assume our custom setup for running gotestsum for itests is hitting one of these conditions.

2) Number of BuildPulse failure looks low

@biglep investigation:

The data actually matches for me when looking at the last two weeks in August:

https://app.buildpulse.io/@filecoin-project/lotus/builds?q=&show_type=builds&filter=failed&starting_at=%222024-08-18T00%3A00%3A00.000Z%22&ending_at=%222024-09-01T00%3A00%3A00.000Z%22 corroborates with https://github.com/filecoin-project/lotus/actions/workflows/test.yml?query=is%3Afailure

The case where there is a failed test.yml that doesn't show up in buildpulse is when an earlier job in the workflow doesn't complete (e.g., "cache dependencies" - example)

I think the next step is figure out the "command-line-arguments" issue.

MarcoPolo commented 2 months ago

Sharing a quick update from go-libp2p. We're saving the outputs of go test into a sqlite file, as well as rerunning failures within the same to measure flakiness and not-fail the overall job. We're using the sqlite data to provide the failure logs upfront, and do some simple analysis via a public observable notebook (example). We also have a daily job that aggregates all these sqlite tables from runs on the master and outputs some analysis on FlakyTestsV2, but this is a WIP.

So far it has been a great quality of life improvement because:

CI now has a much higher signal:noise ratio.
I'm not manually rerunning a job because a single test failed.
The logs for a failure are much easier to find.
Having all the failure logs across workflows in one place is helpful.

fwiw, I think this approach should be pretty easy to replicate. There's very little (if any?) go-libp2p specific code there.

See https://github.com/libp2p/go-libp2p/pull/2923 for more details.

rvagg commented 2 months ago

Thanks @MarcoPolo, very interesting workflow! Why sqlite though? Wouldn't it be easier to sling some json around for this? Is it just because of the complexity of your flaky query?

rvagg commented 2 months ago

regarding command-line-arguments, it comes from the fact that we point directly to a _test.go file, easily replicated if you run one of the itests like we do:

$ go test ./itests/eth_transactions_test.go 
ok      command-line-arguments  29.371s

I just can't find a way around this (yet). Even specifying an -o doesn't help because the problem is in package name inference I think, which it's not doing properly.

MarcoPolo commented 2 months ago

Why sqlite though? Wouldn't it be easier to sling some json around for this? Is it just because of the complexity of your flaky query?

Maybe a personal preference. I prefer a SQL query for this as it’s declarative and probably faster than the equivalent JS. But of course, json would work just fine.

rvagg commented 2 months ago

$ go test ./itests/. -run "$(git grep -E '^func Test[^(]+\(\w+ \*testing\.T\)' itests/eth_transactions_test.go | sed -E 's/^.*func (Test[^(]+)\(.*/\1/' | awk '{printf "|^%s", $0}' | sed 's/^|//')"
ok      github.com/filecoin-project/lotus/itests        28.747s

That's the best I can come up with so far for showing the package name properly and avoiding command-line-arguments. i.e. instead of running the test file directly, we run the tests in the package but restrict execution to a regex made up of all of the tests defined in that file.

Two downsides: we'd have to be sure to have unique test function names across the files (and remember this into the future) and this is a complicated bit of command-line hackery just to make it show up nicer in BuildPulse and I don't think that's a good enough justification to increase the complexity of our CI pipeline.

rvagg commented 2 months ago

Modifying the .xml file directly before uploading https://github.com/filecoin-project/lotus/pull/12434

filecoin-project / lotus