bazelbuild / bazel

a fast, scalable, multi-language and extensible build system
https://bazel.build
Apache License 2.0
23.18k stars 4.06k forks source link

bazel 5.0.0rc3 crashes with local and remote actions being done #14433

Closed philsc closed 2 years ago

philsc commented 2 years ago

Description of the problem / feature request:

I'm trying out 5.0.0rc3 in our CI environment and saw the following crash:

(11:09:44) INFO: Caught InterruptedException from ExecException for remote branch of sensors/tools/linear_range/_objs/LinearRangeControllerModule_static/linear_range_controller_module.pic.o, which may cause a crash.
--
  | (11:10:01) FATAL: bazel crashed due to an internal error. Printing stack trace:
  | java.lang.AssertionError: Neither branch of sensors/tools/linear_range/_objs/LinearRangeControllerModule_static/linear_range_controller_module.pic.o completed. Local was cancelled and done and remote was not cancelled and done.
  | at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.waitBranches(DynamicSpawnStrategy.java:345)
  | at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.exec(DynamicSpawnStrategy.java:733)
  | at com.google.devtools.build.lib.actions.SpawnStrategy.beginExecution(SpawnStrategy.java:47)
  | at com.google.devtools.build.lib.exec.SpawnStrategyResolver.beginExecution(SpawnStrategyResolver.java:68)
  | at com.google.devtools.build.lib.rules.cpp.CppCompileAction.beginExecution(CppCompileAction.java:1430)
  | at com.google.devtools.build.lib.actions.Action.execute(Action.java:133)
  | at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$5.execute(SkyframeActionExecutor.java:907)
  | at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.continueAction(SkyframeActionExecutor.java:1076)
  | at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:1031)
  | at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:152)
  | at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:91)
  | at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:492)
  | at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:856)
  | at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.computeInternal(ActionExecutionFunction.java:349)
  | at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:169)
  | at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:590)
  | at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:382)
  | at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
  | at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
  | at java.base/java.lang.Thread.run(Unknown Source)
  | (11:10:03) Failed with return code 37.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

This is just setting .bazelversion to 5.0.0rc3 and running it on our entire build. This involves a remote buildfarm cluster. I don't know of a "simple" way to reproduce this.

What operating system are you running Bazel on?

Everything's on x86_64 Linux.

What's the output of bazel info release?

$ bazel info release
Starting local Bazel server and connecting to it...
release 5.0.0rc3

Have you found anything relevant by searching the web?

I couldn't find anything pertinent.

Any other information, logs, or outputs that you want to share?

Not at this time.

philsc commented 2 years ago

/cc @Wyverald

coeuvre commented 2 years ago

Is this caused by manual interruptions during the build?

coeuvre commented 2 years ago

/cc @larsrc-google

philsc commented 2 years ago

Is this caused by manual interruptions during the build?

I don't see anything in our logs that would indicate a manual interruption during the build.

EDIT: It's certainly possible that this is a fluke (or maybe a bug in our CI system), but I figured it's worth posting in case it isn't. I've never encountered this error before in our CI system.

jlaxson commented 2 years ago

I'm seeing this also on darwin, definitely no user interruption.

larsrc-google commented 2 years ago

What's the most recent version you're not seeing this issue with?

larsrc-google commented 2 years ago

I submitted a fix for this kind of problem in fe1e25b, and the further changes to src/main/java/com/google/devtools/build/lib/dynamic/DynamicSpawnStrategy.java should not be able to cause this. However, 04754ef can increase the number of actions getting executed without a delay, which could cause this kind of problem to become more common. Before that, 1962a59 did a significant changes, but that was picked into 4.2.0.

If you want to help reproduce it, you can increase the chance of hitting this case by setting --experimental_local_execution_delay=0 and reducing the number of jobs to ~#CPUs, which forces a lot of races. If you're able to reproduce this, could you check the commits above?

@jlaxson Are you seeing that on a CI system or with local builds? @philsc I'm slightly surprised that your CI uses dynamic execution. I would have expected purely remote execution.

jlaxson commented 2 years ago

I'm seeing it locally on 5.0.0rc3, our CI is all remote with --remote_download_minimal. Anecdotally I don't remember it on rc2, but not sure if I was using dynamic execution for much of that time.

But yesterday, I think I hit it about ten times in a row with maybe 1 dirty cpp file and a linking step. I'll dev on 5.0.0-pre.20210708.4 today and see what happens.

coeuvre commented 2 years ago

@jlaxson Did you reproduce this error with 5.0.0rc2 or earlier versions? @philsc Do you still see this error on your CI?

philsc commented 2 years ago

@philsc Do you still see this error on your CI?

I'm unfortunately on vacation until the new year so I haven't been running this RC as frequently as I normally would, apologies. I won't have any more data points until I get back from vacation.

coeuvre commented 2 years ago

No problem. Enjoy your vacation!

brentleyjones commented 2 years ago

This is marked as the last release blocker. @philsc were you able to get more data points?

philsc commented 2 years ago

Apologies, I've been working on #14505 to see if there's anything on our side causing it. I'll see if I can still reproduce this error. Though if it's a race condition (for example) I may not see it again.

philsc commented 2 years ago

Since it came out, I ran 5.0.0rc3 a total 8 times so far on our repo (8 different days) and I encountered the crash 4 times. 2 of those times happened on the same build for different configurations.

In other words, it seems fairly prevalent. At least in our setup. Is there some more debugging information I can try gathering?

larsrc-google commented 2 years ago

Running with --experimental_debug_spawn_scheduler will output some more debugging information. There's also a bunch of information in the log, with that option there will be a lot. If you're able to post it here or mail it to me, I can certainly take a look.

Would you be able to run it with one of the earlier RCs?

Does the error message always say that the local is cancelled and the remote is done, or does it vary? Are there earlier warnings about null results? The only way I can think of this happening is if the remote branch returns null, which it's not supposed to.

philsc commented 2 years ago

These are the 4 crashes. I trimmed other text because it was just regular build progress output.

(15:23:07) INFO: Caught InterruptedException from ExecException for remote branch of simulation/validation/_objs/safety_headway_measurement_validation_params/safety_headway_measurement_validation_params.pic.o, which may cause a crash.
--
  | (15:23:08) FATAL: bazel crashed due to an internal error. Printing stack trace:
  | java.lang.AssertionError: Neither branch of simulation/validation/_objs/safety_headway_measurement_validation_params/safety_headway_measurement_validation_params.pic.o completed. Local was cancelled and done and remote was not cancelled and done.
(10:40:35) INFO: Caught InterruptedException from ExecException for remote branch of common/logging/python/log_uri_python.so, which may cause a crash.
--
  | (10:40:38) FATAL: bazel crashed due to an internal error. Printing stack trace:
  | java.lang.AssertionError: Neither branch of common/logging/python/log_uri_python.so completed. Local was cancelled and done and remote was not cancelled and done.
(11:33:00) INFO: Caught InterruptedException from ExecException for remote branch of planning/ml/libtf_scope_join.so, which may cause a crash.
--
  | (11:33:09) FATAL: bazel crashed due to an internal error. Printing stack trace:
  | java.lang.AssertionError: Neither branch of planning/ml/libtf_scope_join.so completed. Local was cancelled and done and remote was not cancelled and done.
(11:09:44) INFO: Caught InterruptedException from ExecException for remote branch of sensors/tools/linear_range/_objs/LinearRangeControllerModule_static/linear_range_controller_module.pic.o, which may cause a crash.
--
  | (11:10:01) FATAL: bazel crashed due to an internal error. Printing stack trace:
  | java.lang.AssertionError: Neither branch of sensors/tools/linear_range/_objs/LinearRangeControllerModule_static/linear_range_controller_module.pic.o completed. Local was cancelled and done and remote was not cancelled and done.

Would you be able to run it with one of the earlier RCs?

I cannot because they immediately trigger analysis errors. RC3 is the first one I was able to run because of #14289 (i.e. fixed in RC3).

I will trigger some runs with --experimental_debug_spawn_scheduler.

philsc commented 2 years ago

I had one run with --experimental_debug_spawn_scheduler that resulted in 3 crashes. Let me check with legal if I'm allowed to post the whole thing or e-mail it to you, @larsrc-google.

jlaxson commented 2 years ago

I've been on release 5.0.0-pre.20210708.4 for a few weeks and have noticed nothing. Forget why I chose that one...

larsrc-google commented 2 years ago

Those are interesting errors. The message is actually misleading - we caught the InterruptedException from an ExecutionExecption, not an ExecException. I think this means the branch thread (remote execution) threw InterruptedException in some other case than being cancelled. That would make the underlying problem be within the remote executor, which @coeuvre knows more about that me.

While waiting for legal to get back to you, just the outputs from those crashes might be useful in eliminating some possibilities.

I'm also curious why your CI is using dynamic execution as Chi asked. I would have expected the CI to run all builds remotely.

philsc commented 2 years ago

While waiting for legal to get back to you, just the outputs from those crashes might be useful in eliminating some possibilities.

Can you clarify what you mean by outputs? The stack traces are all the same, but let me double-check. EDIT: Yeah, all the stack traces are identical to what I posted in the original report.

I'm also curious why your CI is using dynamic execution as Chi asked. I would have expected the CI to run all builds remotely.

I think @80degreeswest might be a better person to answer that. I haven't been involved with that aspect of our CI a whole lot.

80degreeswest commented 2 years ago

We use --experimental_spawn_scheduler to account for times where small targets seem to run very slow remotely, but execute extremely fast locally (not any specific targets, and not 100% of the time).

larsrc-google commented 2 years ago

@philsc I mean other outputs when run with --experimental_debug_spawn_scheduler. A log, or at least the part of the log leading up to the crash, would be better of course. My two theories are 1) something inside the remote execution system interrupts something and the exception from that escapes and 2) there's some corner case where the dynamic execution system manages to interrupt the remote thread without it being marked as cancelled. I'm leaning strongly towards the former, though.

coeuvre commented 2 years ago

Still trying to reproduce this locally. I have a setup that builds Bazel itself with dynamic execution and ran it for 15+ hours but no lucky.

coeuvre commented 2 years ago

@philsc Can you try with flag --experimental_remote_grpc_log and share the gRPC log? I am curious about what the remote branch did when the error occurs.

larsrc-google commented 2 years ago

@jlaxson @philsc Are you two working on the same code base?

Wyverald commented 2 years ago

@philsc gentle ping. This is one of the last release blockers of 5.0 and I'd like to push out a final RC as soon as possible.

philsc commented 2 years ago

@Wyverald , apologies. I'm back from traveling. I'm not trying to hold up the release. I'm happy to continue investigating this if you decide to release 5.0. I need to catch up on e-mail and figure out what legal is letting me do.

EDIT: I submitted an anonymized log for their review. If accepted, I will post it here.

philsc commented 2 years ago

@jlaxson @philsc Are you two working on the same code base?

@larsrc-google , no we are not.

Wyverald commented 2 years ago

@philsc Thank you! Given how elusive this issue is and its relatively low prevalence, I talked to a few people and decided to go ahead and release an RC4 without resolving this issue. If we can get this debugged and resolved soon-ish (say within a week from now), then we can cherrypick the fix, otherwise we'll just ship 5.0 and aim for a 5.0.1 point release to eventually fix this.

larsrc-google commented 2 years ago

I now have a Bazel setup where I can do dynamic execution, but I don't have a target I can reproduce this with. If anyone sees this happening with a publicly available srcs, please let me know. Until then, I shall see what philsc can provide.

philsc commented 2 years ago

Here's an anonymized build log of one of the crashes: bazel-issue-14433.txt I had to anonymize it in order to be allowed to post it, apologies. It should still be useful though. If not, let me know. Next week I will hopefully be able to reproduce with --experimental_remote_grpc_log

larsrc-google commented 2 years ago

Hm, not much to learn from that, alas. The local branch was cancelled, and then the remote branch got interrupted in a non-cancellation way.

I don't think we can get this debugged quickly. Don't let it hold up the release.

Wyverald commented 2 years ago

@bazel-io fork 5.1

brentleyjones commented 2 years ago

This is marked as a release blocker for 5.1. Has any progress been made on it?

philsc commented 2 years ago

Apologies. I haven't had the time to collect more data. Specifically, reproducing the crash with --experimental_remote_grpc_log. I suspect I will get around to it next week.

tomrenn commented 2 years ago

I'm able to reproduce this as well. I've shared logs using --experimental_remote_grpc_log and --experimental_debug_spawn_scheduler to @coeuvre.

philsc commented 2 years ago

I'm having some trouble extracting these artifacts from our CI pipelines. I should be able to get this resolved, create an anonymous version of the artifacts and upload them here. They appear to be ~300 MiB in size though.

EDIT: I guess the file size is related to how far into the build the crash occurs.

philsc commented 2 years ago

I found one crash that died very early in the run. Here's the anonymized output generated with --experimental_remote_grpc_log: grpc-events-build.bin.5e2ba1cb-21ee-465b-88f6-7cf5520fc0e2.anonymized.gz

That log corresponds to this crash:

(22:38:46) INFO: Analyzed 69157 targets (0 packages loaded, 0 targets configured).
(22:38:46)
(22:38:46) INFO: Found 69157 targets...
(22:38:47) [0 / 280] [Prepa] Creating source manifest for //:file000000 ... (8 actions, 0 running)
(22:38:48) INFO: Caught InterruptedException from ExecException for remote branch of file000001.pic.o, which may cause a crash.
(22:38:48) INFO: Caught InterruptedException from ExecException for remote branch of file000002.pic.o, which may cause a crash.
(22:38:48) INFO: Caught InterruptedException from ExecException for remote branch of file000003.pic.o, which may cause a crash.
(22:38:48) INFO: Caught InterruptedException from ExecException for remote branch of file000004.pic.o, which may cause a crash.
(22:38:48) INFO: Caught InterruptedException from ExecException for remote branch of file000005.pic.o, which may cause a crash.
(22:38:48) FATAL: bazel crashed due to an internal error. Printing stack trace:
java.lang.AssertionError: Neither branch of file000001.pic.o completed. Local was cancelled and done and remote was not cancelled and done.
        at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.waitBranches(DynamicSpawnStrategy.java:345)
        at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.exec(DynamicSpawnStrategy.java:733)
        at com.google.devtools.build.lib.actions.SpawnStrategy.beginExecution(SpawnStrategy.java:47)
        at com.google.devtools.build.lib.exec.SpawnStrategyResolver.beginExecution(SpawnStrategyResolver.java:68)
        at com.google.devtools.build.lib.rules.cpp.CppCompileAction.beginExecution(CppCompileAction.java:1430)
        at com.google.devtools.build.lib.actions.Action.execute(Action.java:133)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$5.execute(SkyframeActionExecutor.java:907)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.continueAction(SkyframeActionExecutor.java:1076)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:1031)
        at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:152)
        at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:91)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:492)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:856)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.computeInternal(ActionExecutionFunction.java:349)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:169)
        at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:590)
        at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:382)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)

Note: I did not make make the filenames match in the text output and the GRPC log. If that's important, let me know and I can rework it.

larsrc-google commented 2 years ago

This looks similar to a thing we've been investigating internally. Our current hypothesis is that something in the remote branch uses interrupts wrongly. It doesn't look like these happen because the remote branch gets interrupted by the dynamic execution system.

There's a general problem that if one of the branches uses interrupts internally, it's impossible to say if an interrupt was from the dynamic execution cancellation or from inside the branch. But that's not what we're seeing here.

coeuvre commented 2 years ago

Found the root cause! Working on the fix.