bazelbuild / bazel

a fast, scalable, multi-language and extensible build system
https://bazel.build
Apache License 2.0
23.14k stars 4.05k forks source link

RBE actions not properly canceled when BES upload times out #19496

Open minor-fixes opened 1 year ago

minor-fixes commented 1 year ago

Description of the bug:

In (at least some) cases, BES upload timing out does not seem to trigger proper cancellation of actions issued over RBE. In my case, I'm seeing:

FAILED: Build did NOT complete successfully
WARNING: Uploading BEP referenced local file: UNAVAILABLE: io exception
WARNING: Uploading BEP referenced local file /builder/home/.cache/bazel/_bazel_root/eab0d61a99b6696edb3d2aff87b585e8/command.profile.gz hash: "2aa3acf372336923a8ed69b9f0878489764153c0de2ac3039b10c92296bcdf2b"
size_bytes: 3461595
: UNAVAILABLE: io exception
FAILED: Build did NOT complete successfully
ERROR: The Build Event Protocol upload timed out. com.google.common.util.concurrent.TimeoutFuture$TimeoutFutureException: Timed out: NonCancellationPropagatingFuture@2eac302a[status=PENDING, info=[delegate=[SettableFuture@6baebb9b[status=PENDING]]]]

and noticing that actions continue to run in Buildbarn despite the bazel client (and indeed the VM it was running on) go away. From a cursory look at the BES code, it would seem that abruptly exiting on BES timeout is the intent.

It's possible that only RBE actions that required a retry are not properly canceled - this seems to be a pattern, and I don't yet have a counterexample.

Since bazel is exiting semi-cleanly (i.e. not getting SIGKILL) it should cancel actions executing via RBE, in order to free up remote resources.

Which category does this issue belong to?

No response

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Unfortunately, I'm unable to reproduce this issue outside of a nightly long-running build; any tips on gathering more debugging info would be greatly appreciated!

Which operating system are you running Bazel on?

linux

What is the output of bazel info release?

release 6.3.2

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

(private repository)

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

No response

Have you found anything relevant by searching the web?

No; no ideas from those on Slack yet either

Any other information, logs, or outputs that you want to share?

No response

tjgq commented 1 year ago

Running with --verbose_failures should reveal the strack trace associated with the UNAVAILABLE: io exception and might provide additional clues.

You also mention that RBE action retries might be a factor - how did you determine that an action was retried?

Regarding cancellation: my understanding is that, upon a build failure (which an abrupt exit qualifies as), Bazel cancels any pending gRPC requests to the remote execution server. But it's up to the remote side whether to actually abort the execution (aborting is not necessarily the best option; letting the action finish could still be advantageous to populate the remote cache for a subsequent build). So it would also help to know what's the expected behavior for Buildbarn in this situation.

minor-fixes commented 1 year ago

Running with --verbose_failures should reveal the strack trace associated with the UNAVAILABLE: io exception and might provide additional clues.

Thanks! I'll give this a shot and report back tomorrow

You also mention that RBE action retries might be a factor - how did you determine that an action was retried?

By looking at RPC traces - I can see that a particular action was started via Execute at time X, trace finishes at time Y, and a second Execute trace for the same target + mnemonic starts at time Y; at the same time, I can see from remote worker pod metrics that a worker pod went away at time Y, and can infer that the pod went away unexpectedly (OoM, cloud spot instance reclaim, etc.), buildbarn reported the failure back to the bazel client, and bazel issued a retry (not conclusive admittedly; but this seems highly probable?)

it would also help to know what's the expected behavior for Buildbarn in this situation

From what I understand, Buildbarn will continue executing actions as long as a client that is executing it has an open session. Buildbarn will deduplicate identical actions, but that is not a factor here, as:

If no clients care about the action, Buildbarn will time out the action in relatively short time (currently 60s for us; the action will otherwise continue to run for 1+ hours)

Additionally, when I Ctrl-C an RBE build, typically I can see RPCs error with a deadline exceeded-type error; when this long-running build gets SIGINT and exits, I don't see cancellation reflected in RPC traces in this way.

We have previously run builds in an environment where on build timeout the VM is abruptly killed; this would cause actions to continue running on buildbarn, because the TCP sessions weren't properly closed (might be using the wrong words here) - I may be able to fix this with TCP keepalive configuration on buildbarn, but it seems that maybe bazel is not shutting down connections as expected?

minor-fixes commented 1 year ago

Here's the full stacktrace:

FAILED: Build did NOT complete successfully
WARNING: Uploading BEP referenced local file: UNAVAILABLE: io exception
java.io.IOException: findMissingBlobs(1) for bes-upload: UNAVAILABLE: io exception
    at com.google.devtools.build.lib.remote.GrpcCacheClient.lambda$findMissingDigests$1(GrpcCacheClient.java:222)
    at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.doFallback(AbstractCatchingFuture.java:203)
    at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.doFallback(AbstractCatchingFuture.java:190)
    at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:133)
    at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
    at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
    at com.google.common.util.concurrent.AggregateFuture.handleException(AggregateFuture.java:200)
    at com.google.common.util.concurrent.AggregateFuture.collectValueFromNonCancelledFuture(AggregateFuture.java:270)
    at com.google.common.util.concurrent.AggregateFuture.lambda$init$0(AggregateFuture.java:149)
    at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
    at com.google.common.util.concurrent.AbstractFuture.setFuture(AbstractFuture.java:851)
    at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:214)
    at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:190)
    at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:142)
    at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
    at com.google.common.util.concurrent.AbstractFuture.setFuture(AbstractFuture.java:851)
    at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:214)
    at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:190)
    at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:142)
    at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
    at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
    at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:104)
    at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
    at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
    at com.google.common.util.concurrent.SettableFuture.setException(SettableFuture.java:55)
    at com.google.devtools.build.lib.remote.util.RxFutures$2.onError(RxFutures.java:257)
    at io.reactivex.rxjava3.internal.operators.single.SingleFlatMap$SingleFlatMapCallback$FlatMapSingleObserver.onError(SingleFlatMap.java:117)
    at io.reactivex.rxjava3.internal.operators.single.SingleUsing$UsingSingleObserver.onError(SingleUsing.java:180)
    at io.reactivex.rxjava3.internal.operators.single.SingleCreate$Emitter.tryOnError(SingleCreate.java:95)
    at io.reactivex.rxjava3.internal.operators.single.SingleCreate$Emitter.onError(SingleCreate.java:81)
    at com.google.devtools.build.lib.remote.util.RxFutures$OnceSingleOnSubscribe$1.onFailure(RxFutures.java:172)
    at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1132)
    at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
    at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
    at io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:563)
    at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:533)
    at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
    at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
    at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
    at com.google.devtools.build.lib.remote.NetworkTimeInterceptor$NetworkTimeCall$1.onClose(NetworkTimeInterceptor.java:81)
    at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:562)
    at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:743)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:722)
    at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
    at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
    at io.grpc.Status.asRuntimeException(Status.java:535)
    ... 14 more
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: buildbarn.local.enfabrica.net/127.0.0.1:8002
Caused by: java.net.ConnectException: finishConnect(..) failed: Connection refused
    at io.netty.channel.unix.Errors.newConnectException0(Errors.java:155)
    at io.netty.channel.unix.Errors.handleConnectErrno(Errors.java:128)
    at io.netty.channel.unix.Socket.finishConnect(Socket.java:320)
    at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:710)
    at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:687)
    at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:567)
    at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:470)
    at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Unknown Source)
WARNING: Uploading BEP referenced local file /builder/home/.cache/bazel/_bazel_root/eab0d61a99b6696edb3d2aff87b585e8/command.profile.gz hash: "5d21225aa1501a67f7de355070ff27823546449a8a3cc3343ad4defc38d961e9"
size_bytes: 3319577
: UNAVAILABLE: io exception
java.io.IOException: Error while uploading artifact with digest '5d21225aa1501a67f7de355070ff27823546449a8a3cc3343ad4defc38d961e9/3319577'
    at com.google.devtools.build.lib.remote.ByteStreamUploader.lambda$uploadBlobAsync$0(ByteStreamUploader.java:171)
    at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.doFallback(AbstractCatchingFuture.java:203)
    at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.doFallback(AbstractCatchingFuture.java:190)
    at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:133)
    at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
    at com.google.common.util.concurrent.AbstractFuture.setFuture(AbstractFuture.java:851)
    at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:124)
    at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
    at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
    at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:104)
    at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
    at com.google.common.util.concurrent.AbstractFuture.setFuture(AbstractFuture.java:851)
    at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:214)
    at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:190)
    at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:142)
    at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
    at com.google.common.util.concurrent.AbstractFuture.setFuture(AbstractFuture.java:851)
    at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:214)
    at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:190)
    at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:142)
    at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
    at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
    at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:104)
    at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
    at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
    at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:104)
    at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
    at com.google.common.util.concurrent.AbstractFuture.setFuture(AbstractFuture.java:851)
    at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:214)
    at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:190)
    at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:142)
    at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
    at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
    at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:104)
    at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
    at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
    at com.google.common.util.concurrent.SettableFuture.setException(SettableFuture.java:55)
    at com.google.devtools.build.lib.remote.util.RxFutures$2.onError(RxFutures.java:257)
    at io.reactivex.rxjava3.internal.operators.single.SingleFlatMap$SingleFlatMapCallback$FlatMapSingleObserver.onError(SingleFlatMap.java:117)
    at io.reactivex.rxjava3.internal.operators.single.SingleUsing$UsingSingleObserver.onError(SingleUsing.java:180)
    at io.reactivex.rxjava3.internal.operators.single.SingleCreate$Emitter.tryOnError(SingleCreate.java:95)
    at io.reactivex.rxjava3.internal.operators.single.SingleCreate$Emitter.onError(SingleCreate.java:81)
    at com.google.devtools.build.lib.remote.util.RxFutures$OnceSingleOnSubscribe$1.onFailure(RxFutures.java:172)
    at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1132)
    at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
    at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
    at io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:563)
    at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:533)
    at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:562)
    at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:743)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:722)
    at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
    at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
    at io.grpc.Status.asRuntimeException(Status.java:535)
    ... 10 more
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: buildbarn.local.enfabrica.net/127.0.0.1:8002
Caused by: java.net.ConnectException: finishConnect(..) failed: Connection refused
    at io.netty.channel.unix.Errors.newConnectException0(Errors.java:155)
    at io.netty.channel.unix.Errors.handleConnectErrno(Errors.java:128)
    at io.netty.channel.unix.Socket.finishConnect(Socket.java:320)
    at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:710)
    at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:687)
    at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:567)
    at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:470)
    at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Unknown Source)
FAILED: Build did NOT complete successfully
ERROR: The Build Event Protocol upload timed out. com.google.common.util.concurrent.TimeoutFuture$TimeoutFutureException: Timed out: NonCancellationPropagatingFuture@327e51a7[status=PENDING, info=[delegate=[SettableFuture@777cc98c[status=PENDING]]]]

The portion findMissingBlobs(1) for bes-upload is strange to me - is it calling FindMissingBlobs on the RBE endpoint, or the BES endpoint?

coeuvre commented 1 year ago

It's calling FindMissingBlobs on RBE endpoint because it needs to upload BEP referenced artifacts to CAS.