bazelbuild / bazel-skylib

Common useful functions and rules for Bazel
https://bazel.build/
Apache License 2.0
376 stars 179 forks source link

NullPointerException with bazel@HEAD in CI #488

Closed sgowroji closed 4 months ago

sgowroji commented 5 months ago

CI: https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/3681#018de327-837e-4bb4-8986-ec35f627c4d9

Platform: Ubuntu

Logs:

FATAL: bazel crashed due to an internal error. Printing stack trace:
java.lang.RuntimeException: Unrecoverable error while evaluating node 'UnshareableActionLookupData{actionLookupKey=ConfiguredTargetKey{label=//tests:subpackages_all_w_exclude, config=BuildConfigurationKey[86b97794d575f038822a36ec1392fdddcdcaebfcd711a3ac8088056fc5bb38f5]}, actionIndex=5}' (requested by nodes 'TestCompletionKey{configuredTargetKey=ConfiguredTargetKey{label=//tests:subpackages_all_w_exclude, config=BuildConfigurationKey[86b97794d575f038822a36ec1392fdddcdcaebfcd711a3ac8088056fc5bb38f5]}, topLevelArtifactContext=com.google.devtools.build.lib.analysis.TopLevelArtifactContext@90904c3b, exclusiveTesting=false}')
    at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:551)
    at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:414)
    at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(Unknown Source)
    at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source)
    at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown Source)
    at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source)
    at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source)
Caused by: java.lang.NullPointerException
    at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:903)
    at com.google.devtools.build.lib.vfs.Path.getRelative(Path.java:121)
    at com.google.devtools.build.lib.sandbox.SandboxStash.takeStashedSandboxInternal(SandboxStash.java:99)
    at com.google.devtools.build.lib.sandbox.SandboxStash.takeStashedSandbox(SandboxStash.java:73)
    at com.google.devtools.build.lib.sandbox.SymlinkedSandboxedSpawn.filterInputsAndDirsToCreate(SymlinkedSandboxedSpawn.java:78)
    at com.google.devtools.build.lib.sandbox.AbstractContainerizingSandboxedSpawn.createFileSystem(AbstractContainerizingSandboxedSpawn.java:135)
    at com.google.devtools.build.lib.sandbox.AbstractSandboxSpawnRunner.runSpawn(AbstractSandboxSpawnRunner.java:146)
    at com.google.devtools.build.lib.sandbox.AbstractSandboxSpawnRunner.exec(AbstractSandboxSpawnRunner.java:113)
    at com.google.devtools.build.lib.sandbox.SandboxModule$SandboxFallbackSpawnRunner.exec(SandboxModule.java:456)
    at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:160)
    at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:119)
    at com.google.devtools.build.lib.exec.SpawnStrategyResolver.exec(SpawnStrategyResolver.java:45)
    at com.google.devtools.build.lib.exec.StandaloneTestStrategy.runTestAttempt(StandaloneTestStrategy.java:652)
    at com.google.devtools.build.lib.exec.StandaloneTestStrategy.beginTestAttempt(StandaloneTestStrategy.java:315)
    at com.google.devtools.build.lib.exec.StandaloneTestStrategy$StandaloneTestRunnerSpawn.execute(StandaloneTestStrategy.java:577)
    at com.google.devtools.build.lib.analysis.test.TestRunnerAction.executeAllAttempts(TestRunnerAction.java:1162)
    at com.google.devtools.build.lib.analysis.test.TestRunnerAction.execute(TestRunnerAction.java:974)
    at com.google.devtools.build.lib.analysis.test.TestRunnerAction.execute(TestRunnerAction.java:951)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.executeAction(SkyframeActionExecutor.java:1140)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:1053)
    at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:166)
    at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:95)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:547)
    at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:903)
    at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.computeInternal(ActionExecutionFunction.java:349)
    at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:191)
    at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:462)
    ... 7 more

Steps:

git clone https://github.com/bazelbuild/bazel-skylib
git reset e60cf0001fdc1500cf5ba2a6bf4ca0b46a1f725f  --hard
export USE_BAZEL_VERSION=9d34f8ab0f1ffb18900feaeb23cb16c93f4e0139
bazel test  //... @bazel_skylib_gazelle_plugin//... 

CC Greenteam @meteorcloudy

fmeum commented 5 months ago

Looks like this was caused by https://cs.opensource.google/bazel/bazel/+/e9022f6731b4f62d3a08bdc4eacce70ad28e3c78.

Cc @oquenchil

oquenchil commented 5 months ago

Taking a look.

oquenchil commented 5 months ago

I'm a bit puzzled about why this happens. I can't reproduce locally with 100000 runs per test, I don't see a race condition and even if the variables that I take from the environment TEST_WORKSPACE and TEST_SRCDIR were null, the string placed inside the map wouldn't be null.

The only thing I can think of right now is adding more logging to detect if we put/get null from the concurrent hash map, then log all the environment variables in the environment and all the entries in the map.

We'd check that in and if we see the exception again use the new info to find out what's going on.

@fmeum wdyt?

fmeum commented 5 months ago

The renameTo call isn't guaranteed to be atomic in all filesystem setups, so couldn't (at least in theory) two test actions both enter the isTestAction branch in takeStashedSandboxInternal for the same stashed sandbox? That would explain the issue.

But yes, more logging would certainly help.

oquenchil commented 5 months ago

I thought it was always atomic within the same filesystem. In which filesystem types isn't it atomic?

fmeum commented 5 months ago

I thought it was always atomic within the same filesystem. In which filesystem types isn't it atomic?

That does seem to be true for the UnixFileSystem, but the JavaIoFileSystem may not be as File#renameTo isn't guaranteed to be atomic according to its javadocs and is used here: https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/vfs/JavaIoFileSystem.java;l=318;drc=9d34f8ab0f1ffb18900feaeb23cb16c93f4e0139

It shouldn't be used on Ubuntu unless there is some other bug, so you are right that this isn't likely to be the cause here.

fmeum commented 5 months ago

I think that there is a race, I'm just not sure whether it can arise in practice: If spawn #1 blocks on the ConcurrentHashMap#put call in https://cs.opensource.google/bazel/bazel/+/f05c9d0b8d32d29847d5b16af1e5f8c20d11f66d:src/main/java/com/google/devtools/build/lib/sandbox/SandboxStash.java;l=160, then it will have already moved its sandbox dir into the stash. Spawn #2 could discover and move this directory before the put returns and would then run into an NPE when querying the map.

oquenchil commented 5 months ago

Ah yes, but would we have seen this error? From the error I'd say there was an entry in the map but it was actually null somehow.

Apart from logging the environment variables and the contents of the map, can you think of anything else that would be useful to log?

fmeum commented 5 months ago

Ah yes, but would we have seen this error? From the error I'd say there was an entry in the map but it was actually null somehow.

Map#get also returns null if the key isn't in the map. I think that's more likely than the value actually being null.

Apart from logging the environment variables and the contents of the map, can you think of anything else that would be useful to log?

The result of the readdir of the stashes directory would also be interesting.

oquenchil commented 5 months ago

Wonderful, thanks!

sgowroji commented 5 months ago

Downstream CI of bazel-skylib is Green now https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/3689#018df801-66e8-42eb-8fcf-b047fad7a8cd. We can close this issue.