bazelbuild / bazel

a fast, scalable, multi-language and extensible build system
https://bazel.build
Apache License 2.0
22.77k stars 3.99k forks source link

CacheNotFoundException when resuming build after remote cache evicted objects #19348

Open sluongng opened 11 months ago

sluongng commented 11 months ago

Description of the bug:

Our remote cache evicts old objects in an LRU fashion.

This means that if you have not built for a while, especially if your config is unique, it's likely that your cache object will get evicted.

Today I resumed my M1 Macbook laptop after a weekend break and got a stack trace like this after my first build

com.google.devtools.build.lib.remote.common.BulkTransferException: 3 errors during bulk transfer:
com.google.devtools.build.lib.remote.common.CacheNotFoundException: Missing digest: d0387e622e30ab61e39b1b91e54ea50f9915789dde7b950fafb0863db4a32ef8/17096
com.google.devtools.build.lib.remote.common.CacheNotFoundException: Missing digest: 9718647251c8d479142d459416079ff5cd9f45031a47aa346d8a6e719e374ffa/28630
com.google.devtools.build.lib.remote.common.CacheNotFoundException: Missing digest: 785e0ead607a37bd9a12179051e6efe53d7fb3eb05cc291e49ad6965ee2b613d/11504
        at com.google.devtools.build.lib.remote.util.RxUtils$BulkTransferExceptionCollector.onResult(RxUtils.java:91)
        ...
        at com.google.devtools.build.lib.remote.RemoteExecutionCache$1.onError(RemoteExecutionCache.java:232)
        ...
        at com.google.devtools.build.lib.remote.util.AsyncTaskCache$1.onError(AsyncTaskCache.java:340)
        at com.google.devtools.build.lib.remote.util.AsyncTaskCache$Execution.onError(AsyncTaskCache.java:206)
        ...
        at com.google.devtools.build.lib.remote.util.RxFutures$OnceCompletableOnSubscribe$1.onFailure(RxFutures.java:102)
        ...
        at com.google.devtools.build.lib.remote.util.RxFutures$2.onError(RxFutures.java:257)
        ...
        at com.google.devtools.build.lib.remote.util.RxFutures$OnceSingleOnSubscribe$1.onFailure(RxFutures.java:172)
        ...
        at com.google.devtools.build.lib.remote.ByteStreamUploader$Writer.seekChunker(ByteStreamUploader.java:489)
        at com.google.devtools.build.lib.remote.ByteStreamUploader$Writer.run(ByteStreamUploader.java:442)
        ...
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
        Suppressed: com.google.devtools.build.lib.remote.common.CacheNotFoundException: Missing digest: d0387e622e30ab61e39b1b91e54ea50f9915789dde7b950fafb0863db4a32ef8/17096
                at com.google.devtools.build.lib.remote.GrpcCacheClient$1.onError(GrpcCacheClient.java:420)
                ...
                at com.google.devtools.build.lib.remote.NetworkTimeInterceptor$NetworkTimeCall$1.onClose(NetworkTimeInterceptor.java:81)
                ...
                ... 5 more
        Suppressed: com.google.devtools.build.lib.remote.common.CacheNotFoundException: Missing digest: 9718647251c8d479142d459416079ff5cd9f45031a47aa346d8a6e719e374ffa/28630
                at com.google.devtools.build.lib.remote.GrpcCacheClient$1.onError(GrpcCacheClient.java:420)
                ...
                at com.google.devtools.build.lib.remote.NetworkTimeInterceptor$NetworkTimeCall$1.onClose(NetworkTimeInterceptor.java:81)
                ...
                ... 5 more
        Suppressed: com.google.devtools.build.lib.remote.common.CacheNotFoundException: Missing digest: 785e0ead607a37bd9a12179051e6efe53d7fb3eb05cc291e49ad6965ee2b613d/11504
                at com.google.devtools.build.lib.remote.GrpcCacheClient$1.onError(GrpcCacheClient.java:420)
                ...
                at com.google.devtools.build.lib.remote.NetworkTimeInterceptor$NetworkTimeCall$1.onClose(NetworkTimeInterceptor.java:81)
                ...
                ... 5 more

This is not fixed after several retries, but it seems to be fixed after I went for lunch and came back to the laptop (no changes made). Assuming this was caused by of idle shutdown of Bazel JVM.

I think the correct expectation here is for Bazel to tell the remote cache / remote executor to re-run the action, but it seems like there could be edge cases that are not being handled properly.

Which category does this issue belong to?

Remote Execution

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Not sure just yet.

We have "build without bytes" turned on with GRPC cache (no disk cache) and remote execution enabled.

Which operating system are you running Bazel on?

MacOS 13.5.1 darwin64

What is the output of bazel info release?

release 6.3.1

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

Irrelevant as the issue would go away if the JVM restart.

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

The problem is not exclusive to MacOS environment, however.

We have seen similar reports from our customers, who set up their Linux CI against our remote cache as well. Typically, it would be something like: do first build, wait several weeks, remote cache evicted, do second build -> similar failure.

Because of this, we have not been able to reproduce the situation reliably.

coeuvre commented 10 months ago

I agree the correct expectation is for Bazel to rerun the action. You mentioned you had several retries, did they fail with the same missing digests?

sluongng commented 10 months ago

Yup they failed with the same missing digests.

sluongng commented 9 months ago
ERROR: /Users/sluongng/work/buildbuddy/buildbuddy/proto/BUILD:86:14: Generating Descriptor Set proto_library //proto:config_proto failed: (Exit 34): Missing digest: 80b9e5491f9626ee26828116d5e016689dafd368783ecadcb939456ba3d25cc5/5798416 for bazel-out/platform_linux-opt-exec-34F00540-ST-094ddd67efaf/bin/external/com_google_protobuf/protoc
com.google.devtools.build.lib.remote.common.BulkTransferException: Missing digest: 80b9e5491f9626ee26828116d5e016689dafd368783ecadcb939456ba3d25cc5/5798416 for bazel-out/platform_linux-opt-exec-34F00540-ST-094ddd67efaf/bin/external/com_google_protobuf/protoc
        at com.google.devtools.build.lib.remote.util.RxUtils$BulkTransferExceptionCollector.onResult(RxUtils.java:91)
        ...
        at com.google.devtools.build.lib.remote.RemoteExecutionCache$1.onError(RemoteExecutionCache.java:232)
        ...
        at com.google.devtools.build.lib.remote.util.AsyncTaskCache$1.onError(AsyncTaskCache.java:340)
        at com.google.devtools.build.lib.remote.util.AsyncTaskCache$Execution.onError(AsyncTaskCache.java:206)
        ...
        at com.google.devtools.build.lib.remote.util.RxFutures$OnceCompletableOnSubscribe$1.onFailure(RxFutures.java:102)
        ...
        at com.google.devtools.build.lib.remote.util.RxFutures$2.onError(RxFutures.java:257)
        ...
        at com.google.devtools.build.lib.remote.util.RxFutures$OnceSingleOnSubscribe$1.onFailure(RxFutures.java:172)
        ...
        at com.google.devtools.build.lib.remote.ByteStreamUploader$Writer.seekChunker(ByteStreamUploader.java:509)
        at com.google.devtools.build.lib.remote.ByteStreamUploader$Writer.run(ByteStreamUploader.java:462)
        ...
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
        Suppressed: com.google.devtools.build.lib.remote.common.CacheNotFoundException: Missing digest: 80b9e5491f9626ee26828116d5e016689dafd368783ecadcb939456ba3d25cc5/5798416 for bazel-out/platform_linux-opt-exec-34F00540-ST-094ddd67efaf/bin/external/com_google_protobuf/protoc
                at com.google.devtools.build.lib.remote.GrpcCacheClient$1.onError(GrpcCacheClient.java:436)
                ...
                at com.google.devtools.build.lib.remote.NetworkTimeInterceptor$NetworkTimeCall$1.onClose(NetworkTimeInterceptor.java:81)
                ...
                ... 5 more

Facing this issue again today on Bazel 6.4.0rc1. The exception is BulkTransferException this time and the stack trace is slightly different.

Issue goes away on immediate retry 🤔

coeuvre commented 9 months ago

Was there any automatic retries?

sluongng commented 9 months ago

No auto retry for me.

coeuvre commented 9 months ago

It looks like in this corner case Bazel wasn't able to detect the cache eviction error and retry. NOT Bazel wasn't able to clear the stale state.

ByteStreamUploader looks suspicious in the stack trace. It seems like the scenario was Bazel was trying upload an input to CAS for remote execution (because it was evicted).

dieortin commented 6 months ago

I'm running into the same problem with bazel 6.4.0

iancha1992 commented 6 months ago

cc: @coeuvre