bazelbuild / bazel

a fast, scalable, multi-language and extensible build system
https://bazel.build
Apache License 2.0
22.96k stars 4.02k forks source link

Remote cache: download error fails build #23033

Closed criemen closed 1 week ago

criemen commented 1 month ago

Description of the bug:

In my understanding, remote cache download errors ought to not fail the build, but rather be logged as warnings.

Today, I hit the following exception

[294 / 296] [Sched] Building XXX/cookbook/libcookbook.jar (2 source files); Downloading YYY/libutil-java7_INTERNAL_TRANSITION-hjar.jar; 44s
ERROR: XXX/cookbook/BUILD.bazel:3:20: Building XXX/cookbook/libcookbook.jar (2 source files) failed: IOException while prefetching for worker:

---8<---8<--- Exception details ---8<---8<---
com.google.devtools.build.lib.remote.common.BulkTransferException: Download of '/v1_Linux-6.5.0-1023-azure-x86_64-with-glibc2.35common/cas/db937f11623980da297e5bf38335e9697ad9f2c70f67b0a0d81f7af4930449bd' timed out. Received 0 bytes.
    at com.google.devtools.build.lib.remote.util.Utils.lambda$mergeBulkTransfer$4(Utils.java:656)
    at com.google.common.util.concurrent.CombinedFuture$AsyncCallableInterruptibleTask.runInterruptibly(CombinedFuture.java:165)
    at com.google.common.util.concurrent.CombinedFuture$AsyncCallableInterruptibleTask.runInterruptibly(CombinedFuture.java:153)
    at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:75)
    at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
    at com.google.common.util.concurrent.CombinedFuture$CombinedFutureInterruptibleTask.execute(CombinedFuture.java:108)
    at com.google.common.util.concurrent.CombinedFuture.handleAllCompleted(CombinedFuture.java:65)
    at com.google.common.util.concurrent.AggregateFuture.processCompleted(AggregateFuture.java:301)
    at com.google.common.util.concurrent.AggregateFuture.decrementCountAndMaybeComplete(AggregateFuture.java:283)
    at com.google.common.util.concurrent.AggregateFuture.lambda$init$1(AggregateFuture.java:181)
    at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1286)
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1055)
    at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:807)
    at com.google.common.util.concurrent.SettableFuture.setException(SettableFuture.java:55)
    at com.google.devtools.build.lib.remote.util.RxFutures$1.onError(RxFutures.java:221)
    at io.reactivex.rxjava3.internal.operators.completable.CompletableFromSingle$CompletableFromSingleObserver.onError(CompletableFromSingle.java:41)
    at io.reactivex.rxjava3.internal.operators.single.SingleCreate$Emitter.tryOnError(SingleCreate.java:95)
    at io.reactivex.rxjava3.internal.operators.single.SingleCreate$Emitter.onError(SingleCreate.java:81)
    at com.google.devtools.build.lib.remote.util.AsyncTaskCache$1.onError(AsyncTaskCache.java:339)
    at com.google.devtools.build.lib.remote.util.AsyncTaskCache$Execution.onError(AsyncTaskCache.java:205)
    at io.reactivex.rxjava3.internal.operators.completable.CompletableToSingle$ToSingle.onError(CompletableToSingle.java:73)
    at io.reactivex.rxjava3.internal.operators.completable.CompletableUsing$UsingObserver.onError(CompletableUsing.java:165)
    at io.reactivex.rxjava3.internal.operators.completable.CompletablePeek$CompletableObserverImplementation.onError(CompletablePeek.java:95)
    at io.reactivex.rxjava3.internal.operators.completable.CompletablePeek$CompletableObserverImplementation.onError(CompletablePeek.java:95)
    at io.reactivex.rxjava3.internal.operators.completable.CompletableCreate$Emitter.tryOnError(CompletableCreate.java:91)
    at io.reactivex.rxjava3.internal.operators.completable.CompletableCreate$Emitter.onError(CompletableCreate.java:77)
    at com.google.devtools.build.lib.remote.util.RxFutures$OnceCompletableOnSubscribe$1.onFailure(RxFutures.java:102)
    at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1119)
    at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1286)
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1055)
    at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:807)
    at com.google.common.util.concurrent.SettableFuture.setException(SettableFuture.java:55)
    at com.google.devtools.build.lib.remote.RemoteCache$3.onFailure(RemoteCache.java:381)
    at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1119)
    at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1286)
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1055)
    at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:807)
    at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:105)
    at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1286)
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1055)
    at com.google.common.util.concurrent.AbstractFuture.setFuture(AbstractFuture.java:850)
    at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:216)
    at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:192)
    at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:144)
    at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1286)
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1055)
    at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:807)
    at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:105)
    at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1286)
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1055)
    at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:807)
    at com.google.common.util.concurrent.SettableFuture.setException(SettableFuture.java:55)
    at com.google.devtools.build.lib.remote.http.HttpCacheClient.lambda$get$6(HttpCacheClient.java:563)
    at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590)
    at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:557)
    at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:492)
    at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:636)
    at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:629)
    at io.netty.util.concurrent.DefaultPromise.setFailure(DefaultPromise.java:110)
    at io.netty.channel.DefaultChannelPromise.setFailure(DefaultChannelPromise.java:89)
    at com.google.devtools.build.lib.remote.http.AbstractHttpHandler.failAndResetUserPromise(AbstractHttpHandler.java:58)
    at com.google.devtools.build.lib.remote.http.AbstractHttpHandler.exceptionCaught(AbstractHttpHandler.java:116)
    at com.google.devtools.build.lib.remote.http.HttpDownloadHandler.exceptionCaught(HttpDownloadHandler.java:172)
    at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:346)
    at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:325)
    at io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:317)
    at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireExceptionCaught(CombinedChannelDuplexHandler.java:424)
    at io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:92)
    at io.netty.channel.CombinedChannelDuplexHandler$1.fireExceptionCaught(CombinedChannelDuplexHandler.java:145)
    at io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:143)
    at io.netty.channel.CombinedChannelDuplexHandler.exceptionCaught(CombinedChannelDuplexHandler.java:231)
    at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:346)
    at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:325)
    at io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:317)
    at io.netty.handler.ssl.SslHandler.exceptionCaught(SslHandler.java:1115)
    at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:346)
    at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:325)
    at io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:317)
    at com.google.devtools.build.lib.remote.http.IdleTimeoutHandler.channelIdle(IdleTimeoutHandler.java:43)
    at io.netty.handler.timeout.IdleStateHandler$AllIdleTimeoutTask.run(IdleStateHandler.java:576)
    at io.netty.handler.timeout.IdleStateHandler$AbstractIdleTask.run(IdleStateHandler.java:475)
    at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
    at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
    at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Unknown Source)
    Suppressed: com.google.devtools.build.lib.remote.http.DownloadTimeoutException: Download of '/v1_Linux-6.5.0-1023-azure-x86_64-with-glibc2.35common/cas/db937f11623980da297e5bf38335e9697ad9f2c70f67b0a0d81f7af4930449bd' timed out. Received 0 bytes.
        ... 29 more
---8<---8<--- End of exception details ---8<---8<---
[295 / 296] checking cached actions
INFO: Elapsed time: 86.073s, Critical Path: 60.81s
INFO: 295 processes: 264 remote cache hit, 20 internal, 4 linux-sandbox, 7 worker.
ERROR: Build did NOT complete successfully
BAZEL FAILED

This indicates to me that when getting a timeout during prefetching data for workers, the build will fail hard, instead of the usual warning that's logged about the timeout, and bazel moving on with the build.

Getting those timeouts in the first place is problematic and something I need to investigate, but having them crash the build is not great.

We're using a HTTP cache (bazel-remote), bazel is configured with

common --remote_max_connections=300
common --credential_helper_timeout=30s
common --credential_helper_cache_duration=4m
common --remote_retries=2
common --experimental_remote_cache_async

(plus cache and credential helper specification, omitted here)

Which category does this issue belong to?

No response

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

No response

Which operating system are you running Bazel on?

Linux, Ubuntu 22.04

What is the output of bazel info release?

release 7.2.1

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse HEAD ?

No response

If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

criemen commented 1 month ago

Reading the code a bit closer, I guess this is sort-of expected behavior: If we have an error connecting to the remote action cache, we just fall back to executing the action (in my case, locally), produce the outputs locally, and move on.

But what happens in this case is:

  1. Action A is found in the remote action cache, so it's not executed
  2. Action A is marked as completed (after all, it's cached)
  3. Action B depends on action A's output.
  4. Action B is not found in the Remote Cache (let's say, it just went away)
  5. Now we need to pull the output from action A from the remote cache, but it just went away
  6. As action A was completed long ago, we have no way of requesting to re-schedule it to produce its outputs locally

This is why there's a whole top-level loop around the entire bazel invocation using experimental_remote_cache_eviction_retries which retries on element-not-found errors. This mechanism wouldn't retry the DownloadTimeoutException I'm hitting here, as that's already taken care of by the in-downloader retry mechanism.

Assuming that the remote cache has a temporary/permanent outage, though, having a retry loop around the entire build invocation would be great, otherwise an outage of the remote cache will mean that all in-flight CI jobs that rely on the cache will fail with an error. The nature of a build cache should be imo that it can go away at any point in time without affecting builds, except that they now get slower.

coeuvre commented 1 month ago

Nice analysis! This is exactly what happened under the hood.

The nature of a build cache should be imo that it can go away at any point in time without affecting builds, except that they now get slower.

I agree with you. I think making --experimental_remote_cache_eviction_retries work with other remote cache error is a valid approach.