Open minor-fixes opened 1 year ago
Running with --verbose_failures
should reveal the strack trace associated with the UNAVAILABLE: io exception
and might provide additional clues.
You also mention that RBE action retries might be a factor - how did you determine that an action was retried?
Regarding cancellation: my understanding is that, upon a build failure (which an abrupt exit qualifies as), Bazel cancels any pending gRPC requests to the remote execution server. But it's up to the remote side whether to actually abort the execution (aborting is not necessarily the best option; letting the action finish could still be advantageous to populate the remote cache for a subsequent build). So it would also help to know what's the expected behavior for Buildbarn in this situation.
Running with --verbose_failures should reveal the strack trace associated with the UNAVAILABLE: io exception and might provide additional clues.
Thanks! I'll give this a shot and report back tomorrow
You also mention that RBE action retries might be a factor - how did you determine that an action was retried?
By looking at RPC traces - I can see that a particular action was started via Execute
at time X, trace finishes at time Y, and a second Execute
trace for the same target + mnemonic starts at time Y; at the same time, I can see from remote worker pod metrics that a worker pod went away at time Y, and can infer that the pod went away unexpectedly (OoM, cloud spot instance reclaim, etc.), buildbarn reported the failure back to the bazel client, and bazel issued a retry (not conclusive admittedly; but this seems highly probable?)
it would also help to know what's the expected behavior for Buildbarn in this situation
From what I understand, Buildbarn will continue executing actions as long as a client that is executing it has an open session. Buildbarn will deduplicate identical actions, but that is not a factor here, as:
If no clients care about the action, Buildbarn will time out the action in relatively short time (currently 60s for us; the action will otherwise continue to run for 1+ hours)
Additionally, when I Ctrl-C an RBE build, typically I can see RPCs error with a deadline exceeded
-type error; when this long-running build gets SIGINT and exits, I don't see cancellation reflected in RPC traces in this way.
We have previously run builds in an environment where on build timeout the VM is abruptly killed; this would cause actions to continue running on buildbarn, because the TCP sessions weren't properly closed (might be using the wrong words here) - I may be able to fix this with TCP keepalive configuration on buildbarn, but it seems that maybe bazel is not shutting down connections as expected?
Here's the full stacktrace:
FAILED: Build did NOT complete successfully
WARNING: Uploading BEP referenced local file: UNAVAILABLE: io exception
java.io.IOException: findMissingBlobs(1) for bes-upload: UNAVAILABLE: io exception
at com.google.devtools.build.lib.remote.GrpcCacheClient.lambda$findMissingDigests$1(GrpcCacheClient.java:222)
at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.doFallback(AbstractCatchingFuture.java:203)
at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.doFallback(AbstractCatchingFuture.java:190)
at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:133)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
at com.google.common.util.concurrent.AggregateFuture.handleException(AggregateFuture.java:200)
at com.google.common.util.concurrent.AggregateFuture.collectValueFromNonCancelledFuture(AggregateFuture.java:270)
at com.google.common.util.concurrent.AggregateFuture.lambda$init$0(AggregateFuture.java:149)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
at com.google.common.util.concurrent.AbstractFuture.setFuture(AbstractFuture.java:851)
at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:214)
at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:190)
at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:142)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
at com.google.common.util.concurrent.AbstractFuture.setFuture(AbstractFuture.java:851)
at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:214)
at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:190)
at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:142)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:104)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
at com.google.common.util.concurrent.SettableFuture.setException(SettableFuture.java:55)
at com.google.devtools.build.lib.remote.util.RxFutures$2.onError(RxFutures.java:257)
at io.reactivex.rxjava3.internal.operators.single.SingleFlatMap$SingleFlatMapCallback$FlatMapSingleObserver.onError(SingleFlatMap.java:117)
at io.reactivex.rxjava3.internal.operators.single.SingleUsing$UsingSingleObserver.onError(SingleUsing.java:180)
at io.reactivex.rxjava3.internal.operators.single.SingleCreate$Emitter.tryOnError(SingleCreate.java:95)
at io.reactivex.rxjava3.internal.operators.single.SingleCreate$Emitter.onError(SingleCreate.java:81)
at com.google.devtools.build.lib.remote.util.RxFutures$OnceSingleOnSubscribe$1.onFailure(RxFutures.java:172)
at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1132)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
at io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:563)
at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:533)
at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
at com.google.devtools.build.lib.remote.NetworkTimeInterceptor$NetworkTimeCall$1.onClose(NetworkTimeInterceptor.java:81)
at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:562)
at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:743)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:722)
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
at io.grpc.Status.asRuntimeException(Status.java:535)
... 14 more
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: buildbarn.local.enfabrica.net/127.0.0.1:8002
Caused by: java.net.ConnectException: finishConnect(..) failed: Connection refused
at io.netty.channel.unix.Errors.newConnectException0(Errors.java:155)
at io.netty.channel.unix.Errors.handleConnectErrno(Errors.java:128)
at io.netty.channel.unix.Socket.finishConnect(Socket.java:320)
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:710)
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:687)
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:567)
at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:470)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Unknown Source)
WARNING: Uploading BEP referenced local file /builder/home/.cache/bazel/_bazel_root/eab0d61a99b6696edb3d2aff87b585e8/command.profile.gz hash: "5d21225aa1501a67f7de355070ff27823546449a8a3cc3343ad4defc38d961e9"
size_bytes: 3319577
: UNAVAILABLE: io exception
java.io.IOException: Error while uploading artifact with digest '5d21225aa1501a67f7de355070ff27823546449a8a3cc3343ad4defc38d961e9/3319577'
at com.google.devtools.build.lib.remote.ByteStreamUploader.lambda$uploadBlobAsync$0(ByteStreamUploader.java:171)
at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.doFallback(AbstractCatchingFuture.java:203)
at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.doFallback(AbstractCatchingFuture.java:190)
at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:133)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
at com.google.common.util.concurrent.AbstractFuture.setFuture(AbstractFuture.java:851)
at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:124)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:104)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
at com.google.common.util.concurrent.AbstractFuture.setFuture(AbstractFuture.java:851)
at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:214)
at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:190)
at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:142)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
at com.google.common.util.concurrent.AbstractFuture.setFuture(AbstractFuture.java:851)
at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:214)
at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:190)
at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:142)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:104)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:104)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
at com.google.common.util.concurrent.AbstractFuture.setFuture(AbstractFuture.java:851)
at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:214)
at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:190)
at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:142)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:104)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
at com.google.common.util.concurrent.SettableFuture.setException(SettableFuture.java:55)
at com.google.devtools.build.lib.remote.util.RxFutures$2.onError(RxFutures.java:257)
at io.reactivex.rxjava3.internal.operators.single.SingleFlatMap$SingleFlatMapCallback$FlatMapSingleObserver.onError(SingleFlatMap.java:117)
at io.reactivex.rxjava3.internal.operators.single.SingleUsing$UsingSingleObserver.onError(SingleUsing.java:180)
at io.reactivex.rxjava3.internal.operators.single.SingleCreate$Emitter.tryOnError(SingleCreate.java:95)
at io.reactivex.rxjava3.internal.operators.single.SingleCreate$Emitter.onError(SingleCreate.java:81)
at com.google.devtools.build.lib.remote.util.RxFutures$OnceSingleOnSubscribe$1.onFailure(RxFutures.java:172)
at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1132)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
at io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:563)
at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:533)
at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:562)
at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:743)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:722)
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
at io.grpc.Status.asRuntimeException(Status.java:535)
... 10 more
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: buildbarn.local.enfabrica.net/127.0.0.1:8002
Caused by: java.net.ConnectException: finishConnect(..) failed: Connection refused
at io.netty.channel.unix.Errors.newConnectException0(Errors.java:155)
at io.netty.channel.unix.Errors.handleConnectErrno(Errors.java:128)
at io.netty.channel.unix.Socket.finishConnect(Socket.java:320)
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:710)
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:687)
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:567)
at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:470)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Unknown Source)
FAILED: Build did NOT complete successfully
ERROR: The Build Event Protocol upload timed out. com.google.common.util.concurrent.TimeoutFuture$TimeoutFutureException: Timed out: NonCancellationPropagatingFuture@327e51a7[status=PENDING, info=[delegate=[SettableFuture@777cc98c[status=PENDING]]]]
The portion findMissingBlobs(1) for bes-upload
is strange to me - is it calling FindMissingBlobs
on the RBE endpoint, or the BES endpoint?
It's calling FindMissingBlobs on RBE endpoint because it needs to upload BEP referenced artifacts to CAS.
Description of the bug:
In (at least some) cases, BES upload timing out does not seem to trigger proper cancellation of actions issued over RBE. In my case, I'm seeing:
and noticing that actions continue to run in Buildbarn despite the bazel client (and indeed the VM it was running on) go away. From a cursory look at the BES code, it would seem that abruptly exiting on BES timeout is the intent.
It's possible that only RBE actions that required a retry are not properly canceled - this seems to be a pattern, and I don't yet have a counterexample.
Since bazel is exiting semi-cleanly (i.e. not getting SIGKILL) it should cancel actions executing via RBE, in order to free up remote resources.
Which category does this issue belong to?
No response
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
Unfortunately, I'm unable to reproduce this issue outside of a nightly long-running build; any tips on gathering more debugging info would be greatly appreciated!
Which operating system are you running Bazel on?
linux
What is the output of
bazel info release
?release 6.3.2
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.No response
What's the output of
git remote get-url origin; git rev-parse master; git rev-parse HEAD
?Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.
No response
Have you found anything relevant by searching the web?
No; no ideas from those on Slack yet either
Any other information, logs, or outputs that you want to share?
No response