Closed sgowroji closed 4 months ago
Looks like this was caused by https://cs.opensource.google/bazel/bazel/+/e9022f6731b4f62d3a08bdc4eacce70ad28e3c78.
Cc @oquenchil
Taking a look.
I'm a bit puzzled about why this happens. I can't reproduce locally with 100000 runs per test, I don't see a race condition and even if the variables that I take from the environment TEST_WORKSPACE
and TEST_SRCDIR
were null, the string placed inside the map wouldn't be null.
The only thing I can think of right now is adding more logging to detect if we put/get null from the concurrent hash map, then log all the environment variables in the environment and all the entries in the map.
We'd check that in and if we see the exception again use the new info to find out what's going on.
@fmeum wdyt?
The renameTo
call isn't guaranteed to be atomic in all filesystem setups, so couldn't (at least in theory) two test actions both enter the isTestAction
branch in takeStashedSandboxInternal
for the same stashed sandbox? That would explain the issue.
But yes, more logging would certainly help.
I thought it was always atomic within the same filesystem. In which filesystem types isn't it atomic?
I thought it was always atomic within the same filesystem. In which filesystem types isn't it atomic?
That does seem to be true for the UnixFileSystem
, but the JavaIoFileSystem
may not be as File#renameTo
isn't guaranteed to be atomic according to its javadocs and is used here: https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/vfs/JavaIoFileSystem.java;l=318;drc=9d34f8ab0f1ffb18900feaeb23cb16c93f4e0139
It shouldn't be used on Ubuntu unless there is some other bug, so you are right that this isn't likely to be the cause here.
I think that there is a race, I'm just not sure whether it can arise in practice: If spawn #1 blocks on the ConcurrentHashMap#put
call in https://cs.opensource.google/bazel/bazel/+/f05c9d0b8d32d29847d5b16af1e5f8c20d11f66d:src/main/java/com/google/devtools/build/lib/sandbox/SandboxStash.java;l=160, then it will have already moved its sandbox dir into the stash. Spawn #2 could discover and move this directory before the put
returns and would then run into an NPE when querying the map.
Ah yes, but would we have seen this error? From the error I'd say there was an entry in the map but it was actually null somehow.
Apart from logging the environment variables and the contents of the map, can you think of anything else that would be useful to log?
Ah yes, but would we have seen this error? From the error I'd say there was an entry in the map but it was actually null somehow.
Map#get
also returns null
if the key isn't in the map. I think that's more likely than the value actually being null
.
Apart from logging the environment variables and the contents of the map, can you think of anything else that would be useful to log?
The result of the readdir
of the stashes directory would also be interesting.
Wonderful, thanks!
Downstream CI of bazel-skylib is Green now https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/3689#018df801-66e8-42eb-8fcf-b047fad7a8cd. We can close this issue.
CI: https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/3681#018de327-837e-4bb4-8986-ec35f627c4d9
Platform: Ubuntu
Logs:
Steps:
CC Greenteam @meteorcloudy