Dynamic execution fails non-deterministically with `InterruptedException` on local work branch

Description of the bug:

When using --spawn_strategy=dynamic Bazel occasionally fails with an internal unhandled exception (exit code 37) in a non-deterministic way. The failure rate is around 2-3% of invocations and retrying immediately never reproduces the same failure.

The internal exception appears to only come from the `DynamicSpawnStrategy` class

After some initial investigation I temporarily added the --debug_spawn_scheduler flag to get more details. With that flag enabled I was able to consistently get details like this on these failures:

INFO: Caught InterruptedException from ExecutionException for local branch of {{Action}}, which may cause a crash.
INFO: CancellationException of remote branch of {{Action}}, returning null
...
FATAL: bazel crashed due to an internal error. Printing stack trace:
java.lang.AssertionError: Neither branch of {{Action}} completed. Local was not cancelled and done and remote was cancelled and done.

The issue does not appear related to the `--dynamic_local_strategy`

The {{Action}} referenced in the error here is always different. Sometimes it comes from a KotlinCompile action which uses --dynamic_local_strategy=KotlinCompile=worker, and other times it comes from JavaIJar actions which use --dynamic_local_strategy=JavaIjar=local. So I don't think the issue is specific to the local strategy used. Although we have not observed any actions using the sandboxed strategy failing in this way. That may just be due to how few actions run with the sandboxed strategy in our builds though. We have also seen rare cases of this failure happening on 'external' rules like external/io_bazel_rules_scala_scala_library_2_13_12/io_bazel_rules_scala_scala_library_2_13_12.stamp/scala-library-2.13.12-stamped.jar.

The issue does not appear to be related to machine resource utilization

Before the investigation began, my initial assumption was that this issue was caused by the OOM killer or some other resource limitation on the current machine. However after digging through our analytics during these builds it seems extremely unlikely that this is the case. We've seen such issues in the past on smaller machines, but aside from failing in a very different way, the actual utilization on these machines is relatively low. On the couple dozen cases that I've spot checked CPU, RAM, Disk & Network utilization were all well below 50%. Even zooming in the time window to the minute or two before failure doesn't show any spike in utilization or any hint of throttling.

The issue DOES appear related to the remote cache

One commonality I saw when going through the detailed profiles for these failures was that the action that failed shouldn't be affected by the changes present on that commit. That is to say that we would have expected that action to be cached remotely already. At this point it's worth mentioning that these builds are being run as a part of our CI pipeline, and that machines are assigned arbitrarily. This means that it is very normal for a single machine to run builds on different commits, jumping back and forth in time as old and new commits are tested. So part of me suspects that this issue only really reproduces when the output base is mostly invalidated due to jumping between very different commits. To potentially mitigate this I tried enabling disk cache. At first I thought this mitigation helped, but over time we still saw this failure scenario. Although due to how these CI machines are scaled based off of demand, it's possible we just aren't able to keep the disk cache warm enough to effectively mitigate.

Which category does this issue belong to?

Core, Local Execution, Remote Execution

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

I have not been able to reproduce this issue in any artificial scenario:

I have tried some suggestions like --experimental_local_execution_delay=0 & --jobs=32 to try to force more actions to actively race on the dynamic strategy.
I have tried entirely deleting the output base in-between runs.
I have tried cherry-picking specific invocations we observed this failure on and running them verbatim
I have tried with and without disk cache
I have tried all combinations and permutations of the above

If I cannot reproduce the issue, how confident am I about the details here?

I'm very confident that this issue is real. Despite not being able to artificially reproduce this behavior, and claiming a relatively low failure rate, this issue is very detectable in my systems. As mentioned above this issue is observed in the CI pipeline at my place of work, so even with the low failure rate I still have significant data to back up this issue. To provide some very rough details:

I had dynamic execution enabled on 'small builds' in our pipeline over a 3 week period
- The threshold for a small build is something I configure.
- Changing this threshold does not substantially impact the failure rate of this issue.
- We have observed this issue on builds ranging from 10K to 50 targets.
- I'm not sure if it's statistically significant, but if anything the failure rate was slightly higher on very small builds of <500 targets. (maybe 4% up from 2-3%)
Over this time period we ran over 20k of these dynamic builds and saw over 500 of these failures
Due to only being able to reproduce this in our CI pipeline, only small, incremental & safe changes to the build are possible to help diagnose the issue
- We could only enable --debug_spawn_scheduler temporarily due to the volume of log spew it produces
- We cannot bisect the Bazel version or try potentially unstable patch fixes.

Which operating system are you running Bazel on?

Linux x86_64

What is the output of `bazel info release`?

7.4.0

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

No response

What's the output of `git remote get-url origin; git rev-parse HEAD` ?

No response

If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.

Unclear if this is a regression. Cannot bisect due to inability to reproduce the issue consistently. Issue verified on 7.0.2 & 7.4.0.

Have you found anything relevant by searching the web?

https://github.com/bazelbuild/bazel/issues/20888
- References the same InterruptedException we see here.
- Discussed lots of possible causes, most of which revolve around local persistent workers.
- We tried using --experimental_dynamic_ignore_local_signals=8,9,10 as outlined in this thread. It did not appear to have much impact.
https://github.com/bazelbuild/bazel/issues/22482
- Possibly related. Not enough context on this issue to be sure though.
https://github.com/bazelbuild/bazel/issues/21773#issuecomment-2388716250
- Mentions the InterruptedException and implies that an error performing downloads could be the root cause. This aligns with some of the other observations we have made.
- The specific issue here was fixed in 7.4.0 though, and we have not seen this issue fixed with that upgrade.

Any other information, logs, or outputs that you want to share?

An abridged list of flags enabled on these builds

--noenable_bzlmod
--nolegacy_important_outputs
--watchfs
--experimental_repository_cache_hardlinks
--incompatible_allow_tags_propagation
--experimental_guard_against_concurrent_changes
--noslim_profile
--noexperimental_merged_skyframe_analysis_execution
--java_language_version=17
--java_runtime_version=remotejdk_17
--tool_java_language_version=17
--tool_java_runtime_version=remotejdk_17
--javacopt="-XepDisableAllChecks --release 8"
--host_javacopt=-XepDisableAllChecks
--experimental_java_classpath=off
--nojava_header_compilation
--noincompatible_java_common_parameters
--sandbox_block_path=/usr/local
--sandbox_block_path=/opt
--verbose_failures
--experimental_profile_include_target_label
--experimental_collect_system_network_usage
--host_cxxopt=-std=c++14
--experimental_repository_cache_urls_as_default_canonical_id
--incompatible_enable_proto_toolchain_resolution
--local_cpu_resources=HOST_CPUS
--local_ram_resources=HOST_RAM
--remote_upload_local_results
--remote_timeout=3600
--jobs=1536
--define=EXECUTOR=remote
--remote_default_exec_properties=OSFamily=Linux
--grpc_keepalive_time=30s
--noexperimental_throttle_action_cache_check
--action_env=BAZEL_DO_NOT_DETECT_CPP_TOOLCHAIN=1
--incompatible_strict_action_env
--incompatible_enable_cc_toolchain_resolution
--extra_execution_platforms=@aspect_gcc_toolchain//platforms:x86_64_linux_remote
--host_platform=@aspect_gcc_toolchain//platforms:x86_64_linux_remote
--symlink_prefix=dist/
--crosstool_top=@gcc_toolchain_x86_64//:_cc_toolchain
--host_cpu=k8
--cpu=k8
--experimental_remote_mark_tool_inputs
--heap_dump_on_oom
--internal_spawn_scheduler
--experimental_worker_cancellation
--noworker_multiplex
--worker_quit_after_build
--dynamic_local_strategy=sandboxed,local,worker
--experimental_dynamic_local_load_factor=2
--local_termination_grace_seconds=60
--remote_download_outputs=all
--remote_local_fallback
--spawn_strategy=dynamic
--dynamic_local_strategy=Javac=worker
--dynamic_local_strategy=JavaDeployJar=local
--dynamic_local_strategy=JavaIjar=local
--dynamic_local_strategy=JavaSourceJar=local
--dynamic_local_strategy=KotlinCompile=worker
--dynamic_local_strategy=Turbine=local
--disk_cache=~/.cache/_bazel_disk_cache
--experimental_disk_cache_gc_max_size=500G
--experimental_disk_cache_gc_max_age=0
--experimental_disk_cache_gc_idle_delay=1m
--nobuild_runfile_links
--strategy=TestRunner=remote
--strategy=Genrule=remote
--experimental_dynamic_ignore_local_signals=8,9,10
--bes_upload_mode=FULLY_ASYNC

bazelbuild / bazel