bazelbuild / bazel

a fast, scalable, multi-language and extensible build system
https://bazel.build
Apache License 2.0
23.23k stars 4.07k forks source link

Dynamic execution fails non-deterministically with `InterruptedException` on local work branch #24230

Open freetheinterns opened 1 week ago

freetheinterns commented 1 week ago

Description of the bug:

When using --spawn_strategy=dynamic Bazel occasionally fails with an internal unhandled exception (exit code 37) in a non-deterministic way. The failure rate is around 2-3% of invocations and retrying immediately never reproduces the same failure.

The internal exception appears to only come from the DynamicSpawnStrategy class

After some initial investigation I temporarily added the --debug_spawn_scheduler flag to get more details. With that flag enabled I was able to consistently get details like this on these failures:

INFO: Caught InterruptedException from ExecutionException for local branch of {{Action}}, which may cause a crash.
INFO: CancellationException of remote branch of {{Action}}, returning null
...
FATAL: bazel crashed due to an internal error. Printing stack trace:
java.lang.AssertionError: Neither branch of {{Action}} completed. Local was not cancelled and done and remote was cancelled and done.

The issue does not appear related to the --dynamic_local_strategy

The {{Action}} referenced in the error here is always different. Sometimes it comes from a KotlinCompile action which uses --dynamic_local_strategy=KotlinCompile=worker, and other times it comes from JavaIJar actions which use --dynamic_local_strategy=JavaIjar=local. So I don't think the issue is specific to the local strategy used. Although we have not observed any actions using the sandboxed strategy failing in this way. That may just be due to how few actions run with the sandboxed strategy in our builds though. We have also seen rare cases of this failure happening on 'external' rules like external/io_bazel_rules_scala_scala_library_2_13_12/io_bazel_rules_scala_scala_library_2_13_12.stamp/scala-library-2.13.12-stamped.jar.

The issue does not appear to be related to machine resource utilization

Before the investigation began, my initial assumption was that this issue was caused by the OOM killer or some other resource limitation on the current machine. However after digging through our analytics during these builds it seems extremely unlikely that this is the case. We've seen such issues in the past on smaller machines, but aside from failing in a very different way, the actual utilization on these machines is relatively low. On the couple dozen cases that I've spot checked CPU, RAM, Disk & Network utilization were all well below 50%. Even zooming in the time window to the minute or two before failure doesn't show any spike in utilization or any hint of throttling.

The issue DOES appear related to the remote cache

One commonality I saw when going through the detailed profiles for these failures was that the action that failed shouldn't be affected by the changes present on that commit. That is to say that we would have expected that action to be cached remotely already. At this point it's worth mentioning that these builds are being run as a part of our CI pipeline, and that machines are assigned arbitrarily. This means that it is very normal for a single machine to run builds on different commits, jumping back and forth in time as old and new commits are tested. So part of me suspects that this issue only really reproduces when the output base is mostly invalidated due to jumping between very different commits. To potentially mitigate this I tried enabling disk cache. At first I thought this mitigation helped, but over time we still saw this failure scenario. Although due to how these CI machines are scaled based off of demand, it's possible we just aren't able to keep the disk cache warm enough to effectively mitigate.

Which category does this issue belong to?

Core, Local Execution, Remote Execution

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

I have not been able to reproduce this issue in any artificial scenario:

If I cannot reproduce the issue, how confident am I about the details here?

I'm very confident that this issue is real. Despite not being able to artificially reproduce this behavior, and claiming a relatively low failure rate, this issue is very detectable in my systems. As mentioned above this issue is observed in the CI pipeline at my place of work, so even with the low failure rate I still have significant data to back up this issue. To provide some very rough details:

Which operating system are you running Bazel on?

Linux x86_64

What is the output of bazel info release?

7.4.0

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse HEAD ?

No response

If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.

Unclear if this is a regression. Cannot bisect due to inability to reproduce the issue consistently. Issue verified on 7.0.2 & 7.4.0.

Have you found anything relevant by searching the web?

Any other information, logs, or outputs that you want to share?

An abridged list of flags enabled on these builds

--noenable_bzlmod
--nolegacy_important_outputs
--watchfs
--experimental_repository_cache_hardlinks
--incompatible_allow_tags_propagation
--experimental_guard_against_concurrent_changes
--noslim_profile
--noexperimental_merged_skyframe_analysis_execution
--java_language_version=17
--java_runtime_version=remotejdk_17
--tool_java_language_version=17
--tool_java_runtime_version=remotejdk_17
--javacopt="-XepDisableAllChecks --release 8"
--host_javacopt=-XepDisableAllChecks
--experimental_java_classpath=off
--nojava_header_compilation
--noincompatible_java_common_parameters
--sandbox_block_path=/usr/local
--sandbox_block_path=/opt
--verbose_failures
--experimental_profile_include_target_label
--experimental_collect_system_network_usage
--host_cxxopt=-std=c++14
--experimental_repository_cache_urls_as_default_canonical_id
--incompatible_enable_proto_toolchain_resolution
--local_cpu_resources=HOST_CPUS
--local_ram_resources=HOST_RAM
--remote_upload_local_results
--remote_timeout=3600
--jobs=1536
--define=EXECUTOR=remote
--remote_default_exec_properties=OSFamily=Linux
--grpc_keepalive_time=30s
--noexperimental_throttle_action_cache_check
--action_env=BAZEL_DO_NOT_DETECT_CPP_TOOLCHAIN=1
--incompatible_strict_action_env
--incompatible_enable_cc_toolchain_resolution
--extra_execution_platforms=@aspect_gcc_toolchain//platforms:x86_64_linux_remote
--host_platform=@aspect_gcc_toolchain//platforms:x86_64_linux_remote
--symlink_prefix=dist/
--crosstool_top=@gcc_toolchain_x86_64//:_cc_toolchain
--host_cpu=k8
--cpu=k8
--experimental_remote_mark_tool_inputs
--heap_dump_on_oom
--internal_spawn_scheduler
--experimental_worker_cancellation
--noworker_multiplex
--worker_quit_after_build
--dynamic_local_strategy=sandboxed,local,worker
--experimental_dynamic_local_load_factor=2
--local_termination_grace_seconds=60
--remote_download_outputs=all
--remote_local_fallback
--spawn_strategy=dynamic
--dynamic_local_strategy=Javac=worker
--dynamic_local_strategy=JavaDeployJar=local
--dynamic_local_strategy=JavaIjar=local
--dynamic_local_strategy=JavaSourceJar=local
--dynamic_local_strategy=KotlinCompile=worker
--dynamic_local_strategy=Turbine=local
--disk_cache=~/.cache/_bazel_disk_cache
--experimental_disk_cache_gc_max_size=500G
--experimental_disk_cache_gc_max_age=0
--experimental_disk_cache_gc_idle_delay=1m
--nobuild_runfile_links
--strategy=TestRunner=remote
--strategy=Genrule=remote
--experimental_dynamic_ignore_local_signals=8,9,10
--bes_upload_mode=FULLY_ASYNC
meisterT commented 3 days ago

Does your build already work with Bazel 8? If so, it would be good to verify whether it still exists with the current RC.

tjgq commented 2 days ago

From bug triage meeting: let's add some more logging so we can better understand what's happening.