Closed philsc closed 2 years ago
/cc @Wyverald
Is this caused by manual interruptions during the build?
/cc @larsrc-google
Is this caused by manual interruptions during the build?
I don't see anything in our logs that would indicate a manual interruption during the build.
EDIT: It's certainly possible that this is a fluke (or maybe a bug in our CI system), but I figured it's worth posting in case it isn't. I've never encountered this error before in our CI system.
I'm seeing this also on darwin, definitely no user interruption.
What's the most recent version you're not seeing this issue with?
I submitted a fix for this kind of problem in fe1e25b, and the further changes to src/main/java/com/google/devtools/build/lib/dynamic/DynamicSpawnStrategy.java should not be able to cause this. However, 04754ef can increase the number of actions getting executed without a delay, which could cause this kind of problem to become more common. Before that, 1962a59 did a significant changes, but that was picked into 4.2.0.
If you want to help reproduce it, you can increase the chance of hitting this case by setting --experimental_local_execution_delay=0
and reducing the number of jobs to ~#CPUs, which forces a lot of races. If you're able to reproduce this, could you check the commits above?
@jlaxson Are you seeing that on a CI system or with local builds? @philsc I'm slightly surprised that your CI uses dynamic execution. I would have expected purely remote execution.
I'm seeing it locally on 5.0.0rc3, our CI is all remote with --remote_download_minimal. Anecdotally I don't remember it on rc2, but not sure if I was using dynamic execution for much of that time.
But yesterday, I think I hit it about ten times in a row with maybe 1 dirty cpp file and a linking step. I'll dev on 5.0.0-pre.20210708.4 today and see what happens.
@jlaxson Did you reproduce this error with 5.0.0rc2 or earlier versions? @philsc Do you still see this error on your CI?
@philsc Do you still see this error on your CI?
I'm unfortunately on vacation until the new year so I haven't been running this RC as frequently as I normally would, apologies. I won't have any more data points until I get back from vacation.
No problem. Enjoy your vacation!
This is marked as the last release blocker. @philsc were you able to get more data points?
Apologies, I've been working on #14505 to see if there's anything on our side causing it. I'll see if I can still reproduce this error. Though if it's a race condition (for example) I may not see it again.
Since it came out, I ran 5.0.0rc3 a total 8 times so far on our repo (8 different days) and I encountered the crash 4 times. 2 of those times happened on the same build for different configurations.
In other words, it seems fairly prevalent. At least in our setup. Is there some more debugging information I can try gathering?
Running with --experimental_debug_spawn_scheduler
will output some more debugging information. There's also a bunch of information in the log, with that option there will be a lot. If you're able to post it here or mail it to me, I can certainly take a look.
Would you be able to run it with one of the earlier RCs?
Does the error message always say that the local is cancelled and the remote is done, or does it vary? Are there earlier warnings about null results? The only way I can think of this happening is if the remote branch returns null, which it's not supposed to.
These are the 4 crashes. I trimmed other text because it was just regular build progress output.
(15:23:07) INFO: Caught InterruptedException from ExecException for remote branch of simulation/validation/_objs/safety_headway_measurement_validation_params/safety_headway_measurement_validation_params.pic.o, which may cause a crash.
--
| (15:23:08) FATAL: bazel crashed due to an internal error. Printing stack trace:
| java.lang.AssertionError: Neither branch of simulation/validation/_objs/safety_headway_measurement_validation_params/safety_headway_measurement_validation_params.pic.o completed. Local was cancelled and done and remote was not cancelled and done.
(10:40:35) INFO: Caught InterruptedException from ExecException for remote branch of common/logging/python/log_uri_python.so, which may cause a crash.
--
| (10:40:38) FATAL: bazel crashed due to an internal error. Printing stack trace:
| java.lang.AssertionError: Neither branch of common/logging/python/log_uri_python.so completed. Local was cancelled and done and remote was not cancelled and done.
(11:33:00) INFO: Caught InterruptedException from ExecException for remote branch of planning/ml/libtf_scope_join.so, which may cause a crash.
--
| (11:33:09) FATAL: bazel crashed due to an internal error. Printing stack trace:
| java.lang.AssertionError: Neither branch of planning/ml/libtf_scope_join.so completed. Local was cancelled and done and remote was not cancelled and done.
(11:09:44) INFO: Caught InterruptedException from ExecException for remote branch of sensors/tools/linear_range/_objs/LinearRangeControllerModule_static/linear_range_controller_module.pic.o, which may cause a crash.
--
| (11:10:01) FATAL: bazel crashed due to an internal error. Printing stack trace:
| java.lang.AssertionError: Neither branch of sensors/tools/linear_range/_objs/LinearRangeControllerModule_static/linear_range_controller_module.pic.o completed. Local was cancelled and done and remote was not cancelled and done.
Would you be able to run it with one of the earlier RCs?
I cannot because they immediately trigger analysis errors. RC3 is the first one I was able to run because of #14289 (i.e. fixed in RC3).
I will trigger some runs with --experimental_debug_spawn_scheduler
.
I had one run with --experimental_debug_spawn_scheduler
that resulted in 3 crashes. Let me check with legal if I'm allowed to post the whole thing or e-mail it to you, @larsrc-google.
I've been on release 5.0.0-pre.20210708.4 for a few weeks and have noticed nothing. Forget why I chose that one...
Those are interesting errors. The message is actually misleading - we caught the InterruptedException
from an ExecutionExecption
, not an ExecException
. I think this means the branch thread (remote execution) threw InterruptedException
in some other case than being cancelled. That would make the underlying problem be within the remote executor, which @coeuvre knows more about that me.
While waiting for legal to get back to you, just the outputs from those crashes might be useful in eliminating some possibilities.
I'm also curious why your CI is using dynamic execution as Chi asked. I would have expected the CI to run all builds remotely.
While waiting for legal to get back to you, just the outputs from those crashes might be useful in eliminating some possibilities.
Can you clarify what you mean by outputs? The stack traces are all the same, but let me double-check. EDIT: Yeah, all the stack traces are identical to what I posted in the original report.
I'm also curious why your CI is using dynamic execution as Chi asked. I would have expected the CI to run all builds remotely.
I think @80degreeswest might be a better person to answer that. I haven't been involved with that aspect of our CI a whole lot.
We use --experimental_spawn_scheduler to account for times where small targets seem to run very slow remotely, but execute extremely fast locally (not any specific targets, and not 100% of the time).
@philsc I mean other outputs when run with --experimental_debug_spawn_scheduler
. A log, or at least the part of the log leading up to the crash, would be better of course. My two theories are 1) something inside the remote execution system interrupts something and the exception from that escapes and 2) there's some corner case where the dynamic execution system manages to interrupt the remote thread without it being marked as cancelled. I'm leaning strongly towards the former, though.
Still trying to reproduce this locally. I have a setup that builds Bazel itself with dynamic execution and ran it for 15+ hours but no lucky.
@philsc Can you try with flag --experimental_remote_grpc_log
and share the gRPC log? I am curious about what the remote branch did when the error occurs.
@jlaxson @philsc Are you two working on the same code base?
@philsc gentle ping. This is one of the last release blockers of 5.0 and I'd like to push out a final RC as soon as possible.
@Wyverald , apologies. I'm back from traveling. I'm not trying to hold up the release. I'm happy to continue investigating this if you decide to release 5.0. I need to catch up on e-mail and figure out what legal is letting me do.
EDIT: I submitted an anonymized log for their review. If accepted, I will post it here.
@jlaxson @philsc Are you two working on the same code base?
@larsrc-google , no we are not.
@philsc Thank you! Given how elusive this issue is and its relatively low prevalence, I talked to a few people and decided to go ahead and release an RC4 without resolving this issue. If we can get this debugged and resolved soon-ish (say within a week from now), then we can cherrypick the fix, otherwise we'll just ship 5.0 and aim for a 5.0.1 point release to eventually fix this.
I now have a Bazel setup where I can do dynamic execution, but I don't have a target I can reproduce this with. If anyone sees this happening with a publicly available srcs, please let me know. Until then, I shall see what philsc can provide.
Here's an anonymized build log of one of the crashes: bazel-issue-14433.txt
I had to anonymize it in order to be allowed to post it, apologies. It should still be useful though. If not, let me know.
Next week I will hopefully be able to reproduce with --experimental_remote_grpc_log
Hm, not much to learn from that, alas. The local branch was cancelled, and then the remote branch got interrupted in a non-cancellation way.
I don't think we can get this debugged quickly. Don't let it hold up the release.
@bazel-io fork 5.1
This is marked as a release blocker for 5.1. Has any progress been made on it?
Apologies. I haven't had the time to collect more data. Specifically, reproducing the crash with --experimental_remote_grpc_log
. I suspect I will get around to it next week.
I'm able to reproduce this as well. I've shared logs using --experimental_remote_grpc_log
and --experimental_debug_spawn_scheduler
to @coeuvre.
I'm having some trouble extracting these artifacts from our CI pipelines. I should be able to get this resolved, create an anonymous version of the artifacts and upload them here. They appear to be ~300 MiB in size though.
EDIT: I guess the file size is related to how far into the build the crash occurs.
I found one crash that died very early in the run. Here's the anonymized output generated with --experimental_remote_grpc_log
:
grpc-events-build.bin.5e2ba1cb-21ee-465b-88f6-7cf5520fc0e2.anonymized.gz
That log corresponds to this crash:
(22:38:46) INFO: Analyzed 69157 targets (0 packages loaded, 0 targets configured).
(22:38:46)
(22:38:46) INFO: Found 69157 targets...
(22:38:47) [0 / 280] [Prepa] Creating source manifest for //:file000000 ... (8 actions, 0 running)
(22:38:48) INFO: Caught InterruptedException from ExecException for remote branch of file000001.pic.o, which may cause a crash.
(22:38:48) INFO: Caught InterruptedException from ExecException for remote branch of file000002.pic.o, which may cause a crash.
(22:38:48) INFO: Caught InterruptedException from ExecException for remote branch of file000003.pic.o, which may cause a crash.
(22:38:48) INFO: Caught InterruptedException from ExecException for remote branch of file000004.pic.o, which may cause a crash.
(22:38:48) INFO: Caught InterruptedException from ExecException for remote branch of file000005.pic.o, which may cause a crash.
(22:38:48) FATAL: bazel crashed due to an internal error. Printing stack trace:
java.lang.AssertionError: Neither branch of file000001.pic.o completed. Local was cancelled and done and remote was not cancelled and done.
at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.waitBranches(DynamicSpawnStrategy.java:345)
at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.exec(DynamicSpawnStrategy.java:733)
at com.google.devtools.build.lib.actions.SpawnStrategy.beginExecution(SpawnStrategy.java:47)
at com.google.devtools.build.lib.exec.SpawnStrategyResolver.beginExecution(SpawnStrategyResolver.java:68)
at com.google.devtools.build.lib.rules.cpp.CppCompileAction.beginExecution(CppCompileAction.java:1430)
at com.google.devtools.build.lib.actions.Action.execute(Action.java:133)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$5.execute(SkyframeActionExecutor.java:907)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.continueAction(SkyframeActionExecutor.java:1076)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:1031)
at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:152)
at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:91)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:492)
at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:856)
at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.computeInternal(ActionExecutionFunction.java:349)
at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:169)
at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:590)
at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:382)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Note: I did not make make the filenames match in the text output and the GRPC log. If that's important, let me know and I can rework it.
This looks similar to a thing we've been investigating internally. Our current hypothesis is that something in the remote branch uses interrupts wrongly. It doesn't look like these happen because the remote branch gets interrupted by the dynamic execution system.
There's a general problem that if one of the branches uses interrupts internally, it's impossible to say if an interrupt was from the dynamic execution cancellation or from inside the branch. But that's not what we're seeing here.
Found the root cause! Working on the fix.
Description of the problem / feature request:
I'm trying out 5.0.0rc3 in our CI environment and saw the following crash:
Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
This is just setting
.bazelversion
to5.0.0rc3
and running it on our entire build. This involves a remote buildfarm cluster. I don't know of a "simple" way to reproduce this.What operating system are you running Bazel on?
Everything's on x86_64 Linux.
What's the output of
bazel info release
?Have you found anything relevant by searching the web?
I couldn't find anything pertinent.
Any other information, logs, or outputs that you want to share?
Not at this time.