bazelbuild / bazel

a fast, scalable, multi-language and extensible build system
https://bazel.build
Apache License 2.0
22.78k stars 4k forks source link

baseline_coverage.dat missing with BES + BwtB #17001

Open keith opened 1 year ago

keith commented 1 year ago

Description of the bug:

We're seeing a log like:

(13:23:40) WARNING: Uploading BEP referenced local files: /private/var/tmp/_bazel_iosci/6462d53b34f6666019a2973c03de6978/execroot/lyftios/bazel-out/darwin_arm64-fastbuild/testlogs/Modules/.../baseline_coverage.dat (No such file or directory)
java.io.FileNotFoundException: /private/var/tmp/_bazel_iosci/6462d53b34f6666019a2973c03de6978/execroot/lyftios/bazel-out/darwin_arm64-fastbuild/testlogs/Modules/.../baseline_coverage.dat (No such file or directory)
    at com.google.devtools.build.lib.unix.NativePosixFiles.stat(Native Method)
    at com.google.devtools.build.lib.unix.UnixFileSystem.statInternal(UnixFileSystem.java:209)
    at com.google.devtools.build.lib.unix.UnixFileSystem.stat(UnixFileSystem.java:199)
    at com.google.devtools.build.lib.vfs.PathTransformingDelegateFileSystem.stat(PathTransformingDelegateFileSystem.java:252)
    at com.google.devtools.build.lib.remote.RemoteActionFileSystem.stat(RemoteActionFileSystem.java:460)
    at com.google.devtools.build.lib.remote.RemoteActionFileSystem.getFileSize(RemoteActionFileSystem.java:392)
    at com.google.devtools.build.lib.vfs.Path.getFileSize(Path.java:559)
    at com.google.devtools.build.lib.remote.util.DigestUtil.compute(DigestUtil.java:61)
    at com.google.devtools.build.lib.remote.ByteStreamBuildEventArtifactUploader.readPathMetadata(ByteStreamBuildEventArtifactUploader.java:186)

For every single test target when we run them on CI with BES uploads and --remote_download_toplevel --experimental_remote_build_event_upload=minimal

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

I'm not sure, it's hard to debug with cache hits vs not

Which operating system are you running Bazel on?

macOS

What is the output of bazel info release?

release 6.0.0rc4

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

brentleyjones commented 1 year ago

@coeuvre @tjgq

coeuvre commented 1 year ago

Probably the test action hit the cache so some files are not downloaded. Might related https://github.com/bazelbuild/bazel/pull/16556.

Can you please provide an repro? Not exactly sure how to trigger this error.

thirtyseven commented 1 year ago

I am also getting this warning on Bazel 6.1.0, but it doesn't seem to matter what values I specify for --remote_download_toplevel or --experimental_remote_build_event_upload=minimal. What's odder is that after the build completes, the baseline_coverage.dat file that Bazel complains it couldn't find is present in the output tree.

UebelAndre commented 1 year ago

I'm seeing this spam in rules_rust builds on Bazel 6.2.1 for the following pull-request https://github.com/bazelbuild/rules_rust/pull/2001

@coeuvre does this provide you a sufficient repro?

https://buildkite.com/bazel/rules-rust-rustlang/builds/8715#0188aaec-43c1-4ad1-a9fe-62cdbf7d03e3

(14:48:09) WARNING: Uploading BEP referenced local files: /var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/execroot/rules_rust/bazel-out/k8-fastbuild/testlogs/test/unit/proc_macro/extern_flag_passed_when_compiling_macro_2018/baseline_coverage.dat (No such file or directory)
java.io.FileNotFoundException: /var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/execroot/rules_rust/bazel-out/k8-fastbuild/testlogs/test/unit/proc_macro/extern_flag_passed_when_compiling_macro_2018/baseline_coverage.dat (No such file or directory)
    at com.google.devtools.build.lib.unix.NativePosixFiles.stat(Native Method)
    at com.google.devtools.build.lib.unix.UnixFileSystem.statInternal(UnixFileSystem.java:210)
    at com.google.devtools.build.lib.unix.UnixFileSystem.stat(UnixFileSystem.java:200)
    at com.google.devtools.build.lib.unix.UnixFileSystem.getFileSize(UnixFileSystem.java:391)
    at com.google.devtools.build.lib.vfs.Path.getFileSize(Path.java:559)
    at com.google.devtools.build.lib.remote.util.DigestUtil.compute(DigestUtil.java:61)
    at com.google.devtools.build.lib.remote.ByteStreamBuildEventArtifactUploader.readPathMetadata(ByteStreamBuildEventArtifactUploader.java:194)
    at com.google.devtools.build.lib.remote.ByteStreamBuildEventArtifactUploader.lambda$upload$7(ByteStreamBuildEventArtifactUploader.java:330)
    at io.reactivex.rxjava3.internal.operators.flowable.FlowableMap$MapSubscriber.onNext(FlowableMap.java:64)
    at io.reactivex.rxjava3.internal.operators.flowable.FlowableFromIterable$IteratorSubscription.fastPath(FlowableFromIterable.java:185)
    at io.reactivex.rxjava3.internal.operators.flowable.FlowableFromIterable$BaseRangeSubscription.request(FlowableFromIterable.java:129)
    at io.reactivex.rxjava3.internal.subscribers.BasicFuseableSubscriber.request(BasicFuseableSubscriber.java:153)
    at io.reactivex.rxjava3.internal.jdk8.FlowableCollectWithCollectorSingle$CollectorSingleObserver.onSubscribe(FlowableCollectWithCollectorSingle.java:102)
    at io.reactivex.rxjava3.internal.subscribers.BasicFuseableSubscriber.onSubscribe(BasicFuseableSubscriber.java:67)
    at io.reactivex.rxjava3.internal.operators.flowable.FlowableFromIterable.subscribe(FlowableFromIterable.java:69)
    at io.reactivex.rxjava3.internal.operators.flowable.FlowableFromIterable.subscribeActual(FlowableFromIterable.java:47)
    at io.reactivex.rxjava3.core.Flowable.subscribe(Flowable.java:15917)
    at io.reactivex.rxjava3.internal.operators.flowable.FlowableMap.subscribeActual(FlowableMap.java:38)
    at io.reactivex.rxjava3.core.Flowable.subscribe(Flowable.java:15917)
    at io.reactivex.rxjava3.internal.jdk8.FlowableCollectWithCollectorSingle.subscribeActual(FlowableCollectWithCollectorSingle.java:71)
    at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
    at io.reactivex.rxjava3.internal.operators.single.SingleFlatMap.subscribeActual(SingleFlatMap.java:37)
    at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
    at io.reactivex.rxjava3.internal.operators.single.SingleFlatMap.subscribeActual(SingleFlatMap.java:37)
    at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
    at io.reactivex.rxjava3.internal.operators.single.SingleMap.subscribeActual(SingleMap.java:35)
    at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
    at io.reactivex.rxjava3.internal.operators.single.SingleUsing.subscribeActual(SingleUsing.java:83)
    at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
    at io.reactivex.rxjava3.internal.operators.single.SingleSubscribeOn$SubscribeOnObserver.run(SingleSubscribeOn.java:89)
    at io.reactivex.rxjava3.internal.schedulers.ScheduledDirectTask.call(ScheduledDirectTask.java:38)
    at io.reactivex.rxjava3.internal.schedulers.ScheduledDirectTask.call(ScheduledDirectTask.java:25)
    at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)
UebelAndre commented 1 year ago

@scentini do you know what would be needed to gain some traction here? These warnings are exploding logs in my projects.

coeuvre commented 1 year ago

Sorry for the delay. I will look into it.

coeuvre commented 1 year ago

I can reproduce this issue using rules_rust with Bazel 6.2.1, but can't reproduce it with Bazel@HEAD. The root cause is similar to the issue described in #17798. I will work on a fix for 6.3.0.

coeuvre commented 1 year ago

And this issue exists without using BwoB. It's just that the file baseline_coverage.dat was not generated at the time when the BEP event was emitted.

coeuvre commented 1 year ago

cc @c-mita in case you are already aware of a fix for this issue.

c-mita commented 1 year ago

I'm not aware of anything regarding this.

Does it complain about missing baseline_coverage.dat files for all transitive targets? Or just the top-level test targets?

coeuvre commented 1 year ago

Just the top-level targets. The missing baseline_coverage.dat files are generated afterwards. My suspicion is there are missing deps between the BaselineCoverageAction and CompletionFunction so that the TargetCompletedEvent was emitted without waiting for baseline_coverage.dat to be generated. I am running a bisection to look which commit fixed this.

coeuvre commented 1 year ago

Bisection points me to 30cf799f58128d2922d3af59a6a907c4d8115952 which forces skyframe to use FJP. And if I rerun the repro with flag --experimental_use_fork_join_pool using 6.2.1, guess what, the issue disappeared.

cc @meisterT

coeuvre commented 1 year ago

In conclusion, the issue is baseline_coverage.dat hasn't been generated when TargetCompleteEvent was emitted which references the file. Then the uploader tried to read it from local filesystem but failed. The baseline_coverage.dat was generated afterwards.

The root cause is:

  1. baseline_coverage.dat is a toplevel artifact and is generated by BaselineCoverageAction.
  2. TargetCompleteEvent is created by CompletionFunction which is executed after a test target had finished. The has a reference to baseline_coverage.dat but the function doesn't check whether the file has been generated.
  3. Since there is no connection between BaselineCoverageAction and CompletionFunction currently, there is no guarantee for the existence of baseline_coverage.dat when TargetCompleteEvent is emitted.

The fact that the issue cannot be reproduced at HEAD is because of flag --experimental_use_fork_join_pool. I guess FJP silently changed how skyframe schedule actions and accidentally make all the BaselineCoverageAction run before CompletionFunction.

To mitigate the issue for 6.x users, we can simply cherry-pick commit 30cf799f58128d2922d3af59a6a907c4d8115952 into 6.3.0. A proper fix requires us somehow ask CompletionFunction wait for the referenced baseline_coverage.dat to be generated before emit the TargetCompleteEvent.