Open BigLep opened 5 months ago
I'm pasting in some comments that had from Slack DMs:
Valuable comments from @marcopolo who lead this effort for go-libp2p:
Jotting a couple thoughts:
Observable notebook: A fork from https://observablehq.com/@observablehq/integration-test-flakiness
FlakyTests Julia notebook:
If you want to do something similar for lotus, I wouldn't recommend using the FlakyTests julia notebook. It's idiosyncratic and someone will spend more time understanding it than rewriting it. I know you didn't ask but here's what I'd recommend:
For more than a couple of requests, it requires a github access token secret. Secrets now require a paid plan from observable (this wasn't the case originally)
I don't believe this is the case, after getting tangled up not finding secrets I discovered that they were just moved to a new location. https://observablehq.com/team/@USERNAME/settings/secrets
. So I got mine working without a paid plan.
Glad to be corrected, thanks!
Slightly modified version of the Observable notebook to include some sorting. This is looking across all branches, so we end up with some non-representative failures (e.g. if a particular branch is doing something funky that impacts one of the tests only, e.g. I think fevm
is failing on my frequently updated rvagg/niporep
branch because it has a new actors bundle and the test needs to be tuned to it), but it's a lot more data than just looking at master so is quite interesting. This is all flakies >10%.
It's not fine grain enough. It tells you which GitHub action workflow is flaky, but not which test is flaky. I didn't find it useful to know that go-test was flaky, I wanted to know that TestFoo was flaky.
That's the great thing about the itest setup in lotus, each test gets their own job so they show up separately in the results by default. Otherwise, we'd have to dig in into the artifacts, parse reports, etc. It is not impossible or even that hard by any means, but if we don't have to do that, that's great!
To sum it all up, I guess the only missing piece is a periodic data aggregation so that we don't have to query GitHub API directly for all individual runs any time someone wants to inspect the flakiness. We could run it on schedule and store the results in the artifacts. We get 90 day retention tops here, but that should be enough to see improvement/deterioration over time. And if it's not, we can defer to a more long-term storage on S3 for example.
Thanks all for the updates.
On the observable notebook side: @rvagg : are you able to share your modified notebook given https://observablehq.com/team/@USERNAME/settings/secrets ?
Concerning BuildPulse: I'll still connect with them this week to get a sense of how their offering compares.
When I try to make it public:
To collaborate in a private notebook you need Pro I believe.
To collaborate in a private notebook you need Pro I believe.
Oh. I was thinking you could make the notebook public but it wouldn't work for anyone else unless they have a secret with the same name in their account (that stores the GITHUB API key). Even if someone had to fork it, that seems fine.
(I know we're still exploring and not committed to the Observable path, but if we do go down it, maybe it makes sense to get a FilOz teamspace?)
each test gets their own job so they show up separately in the results by default.
@galargh: thanks for educating me. Do any of these jobs have multiple in them or are they always one to one? I thought I had seen multiple PASSes in some jobs, but I didn't look too deeply across a very large sample. If we do have multiple tests in a job, I wonder if we'll still want the deeper analysis.
I know we need to decide on some next steps here. We've had the observable notebook and BuildPulse ideas. I also saw a comment from IPDX about extending GitHub Monitoring to cover test flakiness using the “passing rerun heuristic”.
I can't give more thought on this at the moment. I welcome IPDX to drive here with a decision-table or pros/cons list.
I do think its worthwhile for us to try BuildPulse. My meeting with their CEO was good and I like what I saw. It's self-service for us to get started, and the first week is free while we're exploring. The only requirements are that we do junit test output (I don't know if we are but I assume it's easy enough to do?) and wire in a GitHub Action.
@galargh : i think it's fine to spend the time to do these things even if we don't go with BuildPulse so folks can try the tool firsthand and in case it helps guide any alternative approach we take.
Thanks!
For anyone watching, FYI that there is work in progress on this.
There is the BuildPulse PoC (being enabled by https://github.com/filecoin-project/lotus/pull/12225) and IPDX also created some reporting in Grafana at https://ipdx.grafana.net/dashboard/snapshot/TfD5JfybgB0odLmipdZniNgayqr1zv3c (screenshot below)
@galargh : concerning the Grafana dashboard...
(We don't have to invest time here on this yet. Once we compare with the BuildPulse PoC, we can see where we want to double-down.)
2024-09-04 status update: we're still looking at BuildPulse:https://app.buildpulse.io/@filecoin-project/lotus
We've been seeing these issues:
1) Suite and Class are set as "command-line-arguments" Example: https://app.buildpulse.io/@filecoin-project/lotus/tests/22990925769
@biglep investigation: this appears to be the case for all itests when looking at the test.yml workflow (example). All of the "itest" jobs output with "command-line-arguments". According to ChatGPT:
When you see “command-line-arguments” in the output of gotestsum, it usually indicates that the Go test command is being run in a package that doesn’t have a defined package name, or it’s being run in the root of a module where multiple packages exist.
Go assigns “command-line-arguments” as a placeholder name when the test files are compiled directly from the command line without specifying a package. This can also happen if you’re running tests across multiple packages or in a directory that isn’t a Go package itself.
To avoid seeing “command-line-arguments,” make sure you are running gotestsum from within a directory that contains a Go package, or specify the package path explicitly when running the command.
I assume our custom setup for running gotestsum
for itests is hitting one of these conditions.
2) Number of BuildPulse failure looks low
@biglep investigation:
The data actually matches for me when looking at the last two weeks in August:
https://app.buildpulse.io/@filecoin-project/lotus/builds?q=&show_type=builds&filter=failed&starting_at=%222024-08-18T00%3A00%3A00.000Z%22&ending_at=%222024-09-01T00%3A00%3A00.000Z%22 corroborates with https://github.com/filecoin-project/lotus/actions/workflows/test.yml?query=is%3Afailure
The case where there is a failed test.yml that doesn't show up in buildpulse is when an earlier job in the workflow doesn't complete (e.g., "cache dependencies" - example)
I think the next step is figure out the "command-line-arguments" issue.
Sharing a quick update from go-libp2p. We're saving the outputs of go test
into a sqlite file, as well as rerunning failures within the same to measure flakiness and not-fail the overall job. We're using the sqlite data to provide the failure logs upfront, and do some simple analysis via a public observable notebook (example). We also have a daily job that aggregates all these sqlite tables from runs on the master and outputs some analysis on FlakyTestsV2, but this is a WIP.
So far it has been a great quality of life improvement because:
fwiw, I think this approach should be pretty easy to replicate. There's very little (if any?) go-libp2p specific code there.
See https://github.com/libp2p/go-libp2p/pull/2923 for more details.
Thanks @MarcoPolo, very interesting workflow! Why sqlite though? Wouldn't it be easier to sling some json around for this? Is it just because of the complexity of your flaky query?
regarding command-line-arguments
, it comes from the fact that we point directly to a _test.go file, easily replicated if you run one of the itests like we do:
$ go test ./itests/eth_transactions_test.go
ok command-line-arguments 29.371s
I just can't find a way around this (yet). Even specifying an -o
doesn't help because the problem is in package name inference I think, which it's not doing properly.
Why sqlite though? Wouldn't it be easier to sling some json around for this? Is it just because of the complexity of your flaky query?
Maybe a personal preference. I prefer a SQL query for this as it’s declarative and probably faster than the equivalent JS. But of course, json would work just fine.
$ go test ./itests/. -run "$(git grep -E '^func Test[^(]+\(\w+ \*testing\.T\)' itests/eth_transactions_test.go | sed -E 's/^.*func (Test[^(]+)\(.*/\1/' | awk '{printf "|^%s", $0}' | sed 's/^|//')"
ok github.com/filecoin-project/lotus/itests 28.747s
That's the best I can come up with so far for showing the package name properly and avoiding command-line-arguments
. i.e. instead of running the test file directly, we run the tests in the package but restrict execution to a regex made up of all of the tests defined in that file.
Two downsides: we'd have to be sure to have unique test function names across the files (and remember this into the future) and this is a complicated bit of command-line hackery just to make it show up nicer in BuildPulse and I don't think that's a good enough justification to increase the complexity of our CI pipeline.
Modifying the .xml file directly before uploading https://github.com/filecoin-project/lotus/pull/12434
Done Criteria
Why Important
Flaky tests are insidious. We have them in our codebase, and it has two primary problems:
We want to have the standard of "we only merge if CI is green", and that is much easier bar to hold if flakiness has been eradicated.
User/Customer
Maintainers and contributors.
Notes