buchgr / bazel-remote

A remote cache for Bazel
https://bazel.build
Apache License 2.0
576 stars 150 forks source link

Build fails on Bazel 7.0 when remote_download_toplevel flag is enabled #730

Open sanju-naik opened 5 months ago

sanju-naik commented 5 months ago

After upgrading to Bazel 7.0.0 and enabling remote_download_toplevel flag we are noticing our builds are failing intermittently while downloading cached artifacts from remote Cache.

2 errors we get are:

Exec failed due to IOException: Connection reset
Exec failed due to IOException: null

There are no other details in the log. Other things we noticed are :

mostynb commented 5 months ago

Are there any relevant errors or warnings in the bazel-remote log when this occurs?

sanju-naik commented 5 months ago

Today when one of our jobs failed, I got this error log in the job. Does this help in any way to debug this issue?

---8<---8<--- Exception details ---8<---8<---
java.io.IOException: Failed to read @-argument 'bazel-out/ios_arm64-opt-ios-arm64-min12.0-applebin_ios-ST-ee6c0995fb68/bin/<Module>/<Target>.swiftmodule-0.params' from file '/private/var/tmp/_bazel_runner/55c1db80066b6bd30a81b2a1c9b5244e/execroot/__main__/bazel-out/ios_arm64-opt-ios-arm64-min12.0-applebin_ios-ST-ee6c0995fb68/bin/<Module>/<Target>.swiftmodule-0.params'.
    at com.google.devtools.build.lib.worker.WorkerSpawnRunner.expandArgument(WorkerSpawnRunner.java:315)
    at com.google.devtools.build.lib.worker.WorkerSpawnRunner.createWorkRequest(WorkerSpawnRunner.java:246)
    at com.google.devtools.build.lib.worker.WorkerSpawnRunner.execInWorker(WorkerSpawnRunner.java:416)
    at com.google.devtools.build.lib.worker.WorkerSpawnRunner.exec(WorkerSpawnRunner.java:206)
    at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:159)
    at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:119)
    at com.google.devtools.build.lib.exec.SpawnStrategyResolver.exec(SpawnStrategyResolver.java:45)
    at com.google.devtools.build.lib.analysis.actions.SpawnAction.execute(SpawnAction.java:261)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.executeAction(SkyframeActionExecutor.java:1148)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:1065)
    at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:165)
    at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:94)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:562)
    at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:859)
    at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.computeInternal(ActionExecutionFunction.java:333)
    at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:171)
    at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:461)
    at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:414)
    at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(Unknown Source)
    at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source)
    at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown Source)
    at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source)
    at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source)
Caused by: java.io.FileNotFoundException: /private/var/tmp/_bazel_runner/55c1db80066b6bd30a81b2a1c9b5244e/execroot/__main__/bazel-out/ios_arm64-opt-ios-arm64-min12.0-applebin_ios-ST-ee6c0995fb68/bin/<Module>/<Target>.swiftmodule-0.params (No such file or directory)
    at java.base/java.io.FileInputStream.open0(Native Method)
    at java.base/java.io.FileInputStream.open(Unknown Source)
    at java.base/java.io.FileInputStream.<init>(Unknown Source)
    at com.google.devtools.build.lib.unix.UnixFileSystem.createFileInputStream(UnixFileSystem.java:497)
    at com.google.devtools.build.lib.vfs.AbstractFileSystem.createMaybeProfiledInputStream(AbstractFileSystem.java:90)
    at com.google.devtools.build.lib.vfs.AbstractFileSystem.getInputStream(AbstractFileSystem.java:59)
    at com.google.devtools.build.lib.vfs.Path.getInputStream(Path.java:765)
    at com.google.devtools.build.lib.vfs.FileSystemUtils$1.openStream(FileSystemUtils.java:354)
    at com.google.common.io.ByteSource$AsCharSource.openStream(ByteSource.java:474)
    at com.google.common.io.CharSource.openBufferedStream(CharSource.java:126)
    at com.google.common.io.CharSource.readLines(CharSource.java:336)
    at com.google.devtools.build.lib.vfs.FileSystemUtils.readLines(FileSystemUtils.java:834)
    at com.google.devtools.build.lib.worker.WorkerSpawnRunner.expandArgument(WorkerSpawnRunner.java:310)
    ... 23 more
---8<---8<--- End of exception details ---8<---8<---
mostynb commented 5 months ago

I don't know bazel internals, but this stack trace looks like this is failing when trying to execute the action on the client side. Have you tried reporting this error to the bazel project?

mostynb commented 5 months ago

Also, I think the bazel-remote logs would be important to check here- are there any warnings or errors there?

sanju-naik commented 5 months ago

Also, I think the bazel-remote logs would be important to check here- are there any warnings or errors there?

We are seeing these failures on our scheduled pipelines and most of the time these jobs fail during night, and the next day I have a hard time collecting logs from bazel-remote because it keeps logging every event to the log file so by the time I check there are a lot of logs & couldn't figure out the ones specific to these jobs.

Is there a quick way to get logs associated with a particular job?

sanju-naik commented 5 months ago

Also we are still on version 2.3.9. Have we added any fixes related to Bazel 7 in the latest releases?

mostynb commented 5 months ago

Also, I think the bazel-remote logs would be important to check here- are there any warnings or errors there?

We are seeing these failures on our scheduled pipelines and most of the time these jobs fail during night, and the next day I have a hard time collecting logs from bazel-remote because it keeps logging every event to the log file so by the time I check there are a lot of logs & couldn't figure out the ones specific to these jobs.

Is there a quick way to get logs associated with a particular job?

I think it depends a bit on the logging options that you are using. If you have timestamps enabled you can jump to a time just before the error and scan from there. Alternatively if you have access logs enabled you might be able to search for a blob or ActionResult hash from the error (if you have something like that in the bazel logs). Or maybe you could just grep the bazel-remote logs for "error" or "warning" (ignoring case) and see if there's anything interesting.

Also we are still on version 2.3.9. Have we added any fixes related to Bazel 7 in the latest releases?

The releases page has a high-level changelog: https://github.com/buchgr/bazel-remote/releases - but I don't think there are any changes specifically related to bazel 7.

liam-baker-sm commented 4 months ago

Currently we have many bazel 7.0.0 remote_download_toplevel builds each day using a bazel-remote cache without problem. IOException: Connection reset would suggest the connection was dropped. Do you use HTTP(S) or GRPC(S) for the cache url in bazel? Is there a proxy between your bazel clients and the bazel-remote server (even on the same machine)?