Coverage reports are not built reliably

UebelAndre commented 1 year ago

Description of the bug:

https://github.com/bazelbuild/rules_rust/issues/2079 shows coverage reports, despite using --experimental_fetch_all_coverage_outputs, are not consistently built.

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

I unfortunately don't have a reliable repro, the main pipeline (post-merge) regularly fails though. My thoughts are it happens more to builds that have full cache hits.

Which operating system are you running Bazel on?

Linux, MacOS

What is the output of `bazel info release`?

6.3.0

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

No response

What's the output of `git remote get-url origin; git rev-parse master; git rev-parse HEAD` ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

comius commented 1 year ago

@c-mita please triage

c-mita commented 1 year ago

I unfortunately don't have a reliable repro, the main pipeline (post-merge) regularly fails though. My thoughts are it happens more to builds that have full cache hits.

Does this mean the failure is occurring before the combined-report generation, or that the combined report isn't being generated?

UebelAndre commented 1 year ago

I don't know when the failure would occur. All I know is in rules_rust we started asserting on coverage data actually being in bazel-out/_coverage/_coverage_report.dat and it's become a fairly common flake in CI https://buildkite.com/bazel/rules-rust-rustlang/builds/9230#018a28b1-eeb1-4d91-86da-c7e896455ec4

It feels like this happens more when rebasing a PR which I suspect means the build was 100% cached. Perhaps that's contributing to the issue? But I don't think there's enough instrumentation here for me to figure out where something is failing in the coverage reporting.

UebelAndre commented 1 year ago

Not that it's a unique occurrence but https://github.com/bazelbuild/rules_rust/pull/2137#issuecomment-1696561287 shows this is happening again:

https://buildkite.com/bazel/rules-rust-rustlang/builds/9279#018a3e70-c877-4296-ae26-7bc1c4986fc5

I would say a good repro is to open a PR to rules_rust for some trivial change and run git commit --amend --date=now to force CI to run and eventually you will see the failure.

c-mita commented 1 year ago

Looking at https://buildkite.com/bazel/rules-rust-rustlang/builds/9279

It looks like the combined report is being run (because it "very helpfully" outputs a lot of log messages describing what's it's parsing):

(23:19:21) INFO: LCOV coverage report is located at /var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/execroot/rules_rust/bazel-out/_coverage/_coverage_report.dat
 and execpath is bazel-out/_coverage/_coverage_report.dat
(23:19:21) INFO: From Coverage report generation:
Aug 28, 2023 10:44:58 PM com.google.devtools.coverageoutputgenerator.Main getTracefiles
INFO: Found 322 tracefiles.
Aug 28, 2023 10:44:58 PM com.google.devtools.coverageoutputgenerator.Main parseFilesSequentially
INFO: Parsing file bazel-out/k8-fastbuild/testlogs/util/label/label_test/coverage.dat
Aug 28, 2023 10:44:58 PM com.google.devtools.coverageoutputgenerator.Main parseFilesSequentially
INFO: Parsing file bazel-out/k8-fastbuild/testlogs/test/unit/extra_rustc_flags/extra_rustc_flags_not_present_test/coverage.dat
Aug 28, 2023 10:44:58 PM com.google.devtools.coverageoutputgenerator.Main parseFilesSequentially
INFO: Parsing file bazel-out/k8-fastbuild/testlogs/test/unit/rustdoc/lib_with_build_script_test/coverage.dat

Unfortunately the generator doesn't log completion after parsing, just errors; although I don't see any error messages.

But it at least suggests that the tool that produces the final file is indeed being executed. What happens after that is less clear.

UebelAndre commented 1 year ago

This is still a big problem for rules_rust. I'm sure if additional logging were to be added to the coverage mechanics then the flakes we regularly get in that repo could reveal the root cause.

cc @scentini for visibility

UebelAndre commented 8 months ago

@Pavank1992 I think more data can only be provided if changes are made to bazel coverage that provide more verbose logs.

bazelbuild / bazel