Open freetheinterns opened 1 week ago
Does your build already work with Bazel 8? If so, it would be good to verify whether it still exists with the current RC.
From bug triage meeting: let's add some more logging so we can better understand what's happening.
Description of the bug:
When using
--spawn_strategy=dynamic
Bazel occasionally fails with an internal unhandled exception (exit code37
) in a non-deterministic way. The failure rate is around 2-3% of invocations and retrying immediately never reproduces the same failure.The internal exception appears to only come from the
DynamicSpawnStrategy
classAfter some initial investigation I temporarily added the
--debug_spawn_scheduler
flag to get more details. With that flag enabled I was able to consistently get details like this on these failures:The issue does not appear related to the
--dynamic_local_strategy
The
{{Action}}
referenced in the error here is always different. Sometimes it comes from aKotlinCompile
action which uses--dynamic_local_strategy=KotlinCompile=worker
, and other times it comes fromJavaIJar
actions which use--dynamic_local_strategy=JavaIjar=local
. So I don't think the issue is specific to the local strategy used. Although we have not observed any actions using thesandboxed
strategy failing in this way. That may just be due to how few actions run with thesandboxed
strategy in our builds though. We have also seen rare cases of this failure happening on 'external' rules likeexternal/io_bazel_rules_scala_scala_library_2_13_12/io_bazel_rules_scala_scala_library_2_13_12.stamp/scala-library-2.13.12-stamped.jar
.The issue does not appear to be related to machine resource utilization
Before the investigation began, my initial assumption was that this issue was caused by the OOM killer or some other resource limitation on the current machine. However after digging through our analytics during these builds it seems extremely unlikely that this is the case. We've seen such issues in the past on smaller machines, but aside from failing in a very different way, the actual utilization on these machines is relatively low. On the couple dozen cases that I've spot checked CPU, RAM, Disk & Network utilization were all well below 50%. Even zooming in the time window to the minute or two before failure doesn't show any spike in utilization or any hint of throttling.
The issue DOES appear related to the remote cache
One commonality I saw when going through the detailed profiles for these failures was that the action that failed shouldn't be affected by the changes present on that commit. That is to say that we would have expected that action to be cached remotely already. At this point it's worth mentioning that these builds are being run as a part of our CI pipeline, and that machines are assigned arbitrarily. This means that it is very normal for a single machine to run builds on different commits, jumping back and forth in time as old and new commits are tested. So part of me suspects that this issue only really reproduces when the output base is mostly invalidated due to jumping between very different commits. To potentially mitigate this I tried enabling disk cache. At first I thought this mitigation helped, but over time we still saw this failure scenario. Although due to how these CI machines are scaled based off of demand, it's possible we just aren't able to keep the disk cache warm enough to effectively mitigate.
Which category does this issue belong to?
Core, Local Execution, Remote Execution
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
I have not been able to reproduce this issue in any artificial scenario:
--experimental_local_execution_delay=0
&--jobs=32
to try to force more actions to actively race on the dynamic strategy.If I cannot reproduce the issue, how confident am I about the details here?
I'm very confident that this issue is real. Despite not being able to artificially reproduce this behavior, and claiming a relatively low failure rate, this issue is very detectable in my systems. As mentioned above this issue is observed in the CI pipeline at my place of work, so even with the low failure rate I still have significant data to back up this issue. To provide some very rough details:
--debug_spawn_scheduler
temporarily due to the volume of log spew it producesWhich operating system are you running Bazel on?
Linux x86_64
What is the output of
bazel info release
?7.4.0
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.No response
What's the output of
git remote get-url origin; git rev-parse HEAD
?No response
If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.
Unclear if this is a regression. Cannot bisect due to inability to reproduce the issue consistently. Issue verified on
7.0.2
&7.4.0
.Have you found anything relevant by searching the web?
InterruptedException
we see here.--experimental_dynamic_ignore_local_signals=8,9,10
as outlined in this thread. It did not appear to have much impact.InterruptedException
and implies that an error performing downloads could be the root cause. This aligns with some of the other observations we have made.7.4.0
though, and we have not seen this issue fixed with that upgrade.Any other information, logs, or outputs that you want to share?
An abridged list of flags enabled on these builds