bazelbuild / bazel-buildfarm

Bazel remote caching and execution service
https://bazel.build
Apache License 2.0
652 stars 205 forks source link

[bug] NOT_FOUND response on executeRemotely #1303

Open luxe opened 1 year ago

luxe commented 1 year ago

Client sees the following stacktrace:

(22:13:53) ERROR: /code/some/place/BUILD:3179:7: Testing //some/target failed: (Exit 34): NOT_FOUND: null
java.io.IOException: io.grpc.StatusRuntimeException: NOT_FOUND
    at com.google.devtools.build.lib.remote.GrpcRemoteExecutor.executeRemotely(GrpcRemoteExecutor.java:235)
    at com.google.devtools.build.lib.remote.RemoteExecutionService.executeRemotely(RemoteExecutionService.java:1258)
    at com.google.devtools.build.lib.remote.RemoteSpawnRunner.lambda$exec$2(RemoteSpawnRunner.java:268)
    at com.google.devtools.build.lib.remote.Retrier.execute(Retrier.java:244)
    at com.google.devtools.build.lib.remote.RemoteRetrier.execute(RemoteRetrier.java:125)
    at com.google.devtools.build.lib.remote.RemoteRetrier.execute(RemoteRetrier.java:114)
    at com.google.devtools.build.lib.remote.RemoteSpawnRunner.exec(RemoteSpawnRunner.java:243)
    at com.google.devtools.build.lib.exec.SpawnRunner.execAsync(SpawnRunner.java:245)
    at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:146)
    at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:108)
    at com.google.devtools.build.lib.actions.SpawnStrategy.beginExecution(SpawnStrategy.java:47)
    at com.google.devtools.build.lib.exec.SpawnStrategyResolver.beginExecution(SpawnStrategyResolver.java:68)
    at com.google.devtools.build.lib.exec.StandaloneTestStrategy.beginTestAttempt(StandaloneTestStrategy.java:440)
    at com.google.devtools.build.lib.exec.StandaloneTestStrategy.access$200(StandaloneTestStrategy.java:84)
    at com.google.devtools.build.lib.exec.StandaloneTestStrategy$StandaloneTestRunnerSpawn.beginExecution(StandaloneTestStrategy.java:672)
    at com.google.devtools.build.lib.analysis.test.TestRunnerAction.beginIfNotCancelled(TestRunnerAction.java:921)
    at com.google.devtools.build.lib.analysis.test.TestRunnerAction.beginExecution(TestRunnerAction.java:888)
    at com.google.devtools.build.lib.analysis.test.TestRunnerAction.execute(TestRunnerAction.java:946)
    at com.google.devtools.build.lib.analysis.test.TestRunnerAction.execute(TestRunnerAction.java:937)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$5.execute(SkyframeActionExecutor.java:907)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.continueAction(SkyframeActionExecutor.java:1076)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:1031)
    at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:152)
    at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:91)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:492)
    at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:856)
    at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.computeInternal(ActionExecutionFunction.java:349)
    at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:169)
    at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:590)
    at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:382)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.grpc.StatusRuntimeException: NOT_FOUND
    at io.grpc.Status.asRuntimeException(Status.java:535)
    at io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:648)
    at com.google.devtools.build.lib.remote.GrpcRemoteExecutor.lambda$executeRemotely$2(GrpcRemoteExecutor.java:169)
    at com.google.devtools.build.lib.remote.Retrier.execute(Retrier.java:244)
    at com.google.devtools.build.lib.remote.RemoteRetrier.execute(RemoteRetrier.java:125)
    at com.google.devtools.build.lib.remote.RemoteRetrier.execute(RemoteRetrier.java:114)
    at com.google.devtools.build.lib.remote.GrpcRemoteExecutor.lambda$executeRemotely$3(GrpcRemoteExecutor.java:140)
    at com.google.devtools.build.lib.remote.util.Utils.refreshIfUnauthenticated(Utils.java:525)
    at com.google.devtools.build.lib.remote.GrpcRemoteExecutor.executeRemotely(GrpcRemoteExecutor.java:138)
    ... 32 more

We should provide an explanation for the NOT_FOUND response to the client. It appears that the operation may not have been found by the watcher: https://github.com/bazelbuild/bazel-buildfarm/blob/74f7799d08c0b6bd02a8730d873338f461a2c6be/src/main/java/build/buildfarm/server/services/ExecutionService.java#L124-L126 However, I don't see that code path as feasible.

luxe commented 1 year ago

It's likely from here because the operation expired from redis somehow: https://github.com/bazelbuild/bazel-buildfarm/blob/74f7799d08c0b6bd02a8730d873338f461a2c6be/src/main/java/build/buildfarm/instance/shard/ShardInstance.java#L2502-L2505

werkt commented 1 year ago

That NOT_FOUND was supposed to inspire the client to retry the execution (in the virtuous loop in RemoteSpawnStrategy, outside of executeRemotely). This inspiration has fallen flat, as bazel obviously sees this as a build-breaking error.

The relevant retrier and logic is here, and only currently says 'shouldRetry' for an all-NOT_FOUND BulkTransferException, and a failed precondition status that 'looks' like a retriable error. We can try to shim this response in as a FailedPrecondition to fake out the bazel client, or we can get clarity on the REAPI itself, which is currently mum on the possible status returns for waitExecution in the case of an unknown operation id, which will be required to get the remote code fixed in bazel.