Open sluongng opened 11 months ago
I agree the correct expectation is for Bazel to rerun the action. You mentioned you had several retries, did they fail with the same missing digests?
Yup they failed with the same missing digests.
ERROR: /Users/sluongng/work/buildbuddy/buildbuddy/proto/BUILD:86:14: Generating Descriptor Set proto_library //proto:config_proto failed: (Exit 34): Missing digest: 80b9e5491f9626ee26828116d5e016689dafd368783ecadcb939456ba3d25cc5/5798416 for bazel-out/platform_linux-opt-exec-34F00540-ST-094ddd67efaf/bin/external/com_google_protobuf/protoc
com.google.devtools.build.lib.remote.common.BulkTransferException: Missing digest: 80b9e5491f9626ee26828116d5e016689dafd368783ecadcb939456ba3d25cc5/5798416 for bazel-out/platform_linux-opt-exec-34F00540-ST-094ddd67efaf/bin/external/com_google_protobuf/protoc
at com.google.devtools.build.lib.remote.util.RxUtils$BulkTransferExceptionCollector.onResult(RxUtils.java:91)
...
at com.google.devtools.build.lib.remote.RemoteExecutionCache$1.onError(RemoteExecutionCache.java:232)
...
at com.google.devtools.build.lib.remote.util.AsyncTaskCache$1.onError(AsyncTaskCache.java:340)
at com.google.devtools.build.lib.remote.util.AsyncTaskCache$Execution.onError(AsyncTaskCache.java:206)
...
at com.google.devtools.build.lib.remote.util.RxFutures$OnceCompletableOnSubscribe$1.onFailure(RxFutures.java:102)
...
at com.google.devtools.build.lib.remote.util.RxFutures$2.onError(RxFutures.java:257)
...
at com.google.devtools.build.lib.remote.util.RxFutures$OnceSingleOnSubscribe$1.onFailure(RxFutures.java:172)
...
at com.google.devtools.build.lib.remote.ByteStreamUploader$Writer.seekChunker(ByteStreamUploader.java:509)
at com.google.devtools.build.lib.remote.ByteStreamUploader$Writer.run(ByteStreamUploader.java:462)
...
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Suppressed: com.google.devtools.build.lib.remote.common.CacheNotFoundException: Missing digest: 80b9e5491f9626ee26828116d5e016689dafd368783ecadcb939456ba3d25cc5/5798416 for bazel-out/platform_linux-opt-exec-34F00540-ST-094ddd67efaf/bin/external/com_google_protobuf/protoc
at com.google.devtools.build.lib.remote.GrpcCacheClient$1.onError(GrpcCacheClient.java:436)
...
at com.google.devtools.build.lib.remote.NetworkTimeInterceptor$NetworkTimeCall$1.onClose(NetworkTimeInterceptor.java:81)
...
... 5 more
Facing this issue again today on Bazel 6.4.0rc1.
The exception is BulkTransferException
this time and the stack trace is slightly different.
Issue goes away on immediate retry 🤔
Was there any automatic retries?
No auto retry for me.
It looks like in this corner case Bazel wasn't able to detect the cache eviction error and retry. NOT Bazel wasn't able to clear the stale state.
ByteStreamUploader
looks suspicious in the stack trace. It seems like the scenario was Bazel was trying upload an input to CAS for remote execution (because it was evicted).
I'm running into the same problem with bazel 6.4.0
cc: @coeuvre
Description of the bug:
Our remote cache evicts old objects in an LRU fashion.
This means that if you have not built for a while, especially if your config is unique, it's likely that your cache object will get evicted.
Today I resumed my M1 Macbook laptop after a weekend break and got a stack trace like this after my first build
This is not fixed after several retries, but it seems to be fixed after I went for lunch and came back to the laptop (no changes made). Assuming this was caused by of idle shutdown of Bazel JVM.
I think the correct expectation here is for Bazel to tell the remote cache / remote executor to re-run the action, but it seems like there could be edge cases that are not being handled properly.
Which category does this issue belong to?
Remote Execution
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
Not sure just yet.
We have "build without bytes" turned on with GRPC cache (no disk cache) and remote execution enabled.
Which operating system are you running Bazel on?
MacOS 13.5.1 darwin64
What is the output of
bazel info release
?release 6.3.1
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.No response
What's the output of
git remote get-url origin; git rev-parse master; git rev-parse HEAD
?No response
Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.
Irrelevant as the issue would go away if the JVM restart.
Have you found anything relevant by searching the web?
No response
Any other information, logs, or outputs that you want to share?
The problem is not exclusive to MacOS environment, however.
We have seen similar reports from our customers, who set up their Linux CI against our remote cache as well. Typically, it would be something like: do first build, wait several weeks, remote cache evicted, do second build -> similar failure.
Because of this, we have not been able to reproduce the situation reliably.