Open fmeum opened 10 months ago
CC @lberki
Let me see if I can just remove that flag from the test bazelrc. Now that we have hermetic /tmp
and a new JVM, it should not be necessary, but ISTR that there were some test breakages when I tried removing it on the side.
Note to self: bazel test --test_output=streamed //src/test/shell/bazel:cc_integration_test --nocache_test_results --test_filter=test_env_inherit_cc_test
seems to be an easy way to debug what's going on (so far, I have no clue)
I bet it's the recent reshuffling of /tmp
(https://github.com/bazelbuild/bazel/commit/620d617b440258799caa1be434ed66e9ca8fa8c5 or one of the others). What I see is:
linux-sandbox \
... \
-w /tmp/bazel-working-directory/_main/_tmp/b3cdc2524f2b7d337eecbf6bc2c53cb7/root/bca1883186b7bd452a3b6f6da322260c/sandbox/linux-sandbox/4/execroot/_main/_tmp/12935b5e0c672f7d902613a432271f4c \
...
-M /tmp/bazel-working-directory/_main/_tmp/b3cdc2524f2b7d337eecbf6bc2c53cb7/root/bca1883186b7bd452a3b6f6da322260c/sandbox/linux-sandbox/4/_hermetic_tmp -m /tmp ...
...
src/main/tools/linux-sandbox-pid1.cc:311: writable: /tmp/bazel-working-directory/_main/_tmp/b3cdc2524f2b7d337eecbf6bc2c53cb7/root/bca1883186b7bd452a3b6f6da322260c/sandbox/linux-sandbox/4/execroot/_main/_tmp/12935b5e0c672f7d902613a432271f4c
And my guess is that this is because $TEST_TMPDIR
gets its pre-remapping value.
Bingo! There is a sinister comment here:
which indicates that 'll probably have to do more digging, sigh.
That comment probably references this code:
and apparently that's how it creates $TEST_TMPDIR
.
Arguably, $TEST_TMPDIR
is set to a completely wrong value because it's set to something that starts with the execroot (outside the sandbox) and the execroot inside the sandbox is something very different. But we also have --test_tmpdir
and $TEST_TMPDIR
is computed in StandaloneTestStrategy
which doesn't know what sandbox (if any) is in use, so can't trivially make $TEST_TMPDIR
something hard-wired under /tmp
, at least not easily.
Grump squared.
Update: I have a fix for this, but I still see widespread breakages when I remove --sandbox_tmpfs_path
from the test bazelrc, mostly in Android.
There are still a few test cases in cc_integration_test
so I'll start with those because I would profoundly enjoy not having to debug Android builds.
The test failure in the coverage integration test looks similar to this as of now unresolved Bazel 7.0.0 regression: https://github.com/bazelbuild/bazel/issues/20556
I had some scraps of time to track down test_execroot_subdir_layout_fails_for_external_subpackages
.
One lesson is that the presence of --sandbox_tmpfs_path=/tmp
negated the flag flip for --incompatible_sandbox_hermetic_tmp
in our test battery, so the functionality was untested the whole time. Go me.
For that test case in particular, one could argue that it's WAI: it tests that builds requiring files under the external/
directory don't work (because they are overridden by external repositories). The reason why the test case fails with hermetic /tmp
is that the new layout of the sandbox actually makes it possible to access those files from within the action. Of course, it's a namespace clash (external/foo/bar.cc
and @foo//:bar.cc
have the same execpath) which is disgusting, but one could argue that this is in fact an improvement.
There are a number of other test cases that I think fail for the same reason and Androd builds are hopelessly broken. haven't been able to figure out how because the little scraps of time I had didn't allow me to put together and Android SDK and NDK good enough for Bazel today..
In addition to test_execroot_subdir_layout_fails_for_external_subpackages
I found at least three (no joke) different issues:
$TEST_TMPDIR
is not handled appropriately. Its value is set to something under the outside-of-the-sandbox execroot, which is, while available inside the sandbox, is probably wrong. It also fails spectacularly if the output base happens to be under /tmp
. This is pretty easy (if moderately hacky) to fix.
The Linux sandbox now relies on the parent tree artifacts of any tree file artifacts in the inputs of the action to be present in the InputMetadataProvider
. Turns out, this is not guaranteed and there is one very special case where a TreeFileArtifact
in fact goes without its parent tree artifact: action templates. This is also pretty easy to fix and although a call site of SkyframeActionExecutor.requiresTreeMetadataWhenTreeFileIsInput()
has a warning about worse performance when it's true, I think it's only a minor performance loss so the fix is actually a one-liner.
Hermetic /tmp
does not work if the source tree or an external repository contains symlinks to /tmp
. This is almost a tautology, but unfortunately, it is also the case for our integration tests since external repositories that are "bridged" from the outer to the inner Bazel invocation (e.g. the Android SDK) are just like that: the repository is a separate source root in both Bazel invocations, thus, the outer Bazel invocation makes it visible to the test under /tmp/bazel-source-roots/$i
and $OUTPUT_BASE/external/$repo
contains symlnks to that.
I'm not sure if the right approach here is to special-case /tmp/bazel-source-roots
, to flip off the flag in our tests (this time, for a good reason) or to fix it in our integration testing framework.
wrt. Bazel 7.0.1, I propose we do the following:
/tmp
in such a way that is specifically contraindicated with --incompatible_sandbox_hermetic_tmp
If (2) doesn't make any more skeletons fall out from the closet, we call the fire drill done. If it does, well, that's unfortunate.
@meteorcloudy WDYT?
@bazel-io fork 7.0.1
@bazel-io fork 7.1.0
Yet Another issue: it looks like deploy jar actions don't quite work. The symptom is:
ERROR: /tmp/bazel-working-directory/_main/_tmp/f6553c75b5e5f9d32addcdb402698239/workspace/java/com/a/BUILD:1:12: Building deploy jar java/com/a/a_deploy.jar failed: (Exit 1): singlejar_local failed: error executing JavaDeployJar command (from target //java/com/a:a_deployjars_internal_rule) external/rules_java~7.3.2~toolchains~remote_java_tools_linux/java_tools/src/tools/singlejar/singlejar_local @bazel-out/k8-fastbuild/bin/java/com/a/a_deploy.jar-0.params
Quick debugging shows that the action input bazel-out/k8-fastbuild/bin/external/bazel_tools/tools/build_defs/build_info/redacted_file.properties
is a symlink to /tmp/bazel-working-directory/_main/_tmp/f6553c75b5e5f9d32addcdb402698239/root/install/5c1a48ea43805eea854857f7fc475a0d/embedded_tools/tools/build_defs/build_info/templates/redacted_file.properties.template
, which is not available in the sandbox of the inner Bazel instance. I haven't checked yet why. Repro is:
mkdir -p java/com/a
cat > java/com/a/BUILD <<'EOF'
java_binary(name="a", srcs=["A.java"])
EOF
cat > java/com/a/A.java <<'EOF'
package a;
class A {
public static void main(String[] args) {
System.out.println("hello");
}
}
EOF
bazel build //java/com/a:a_deploy.jar
After adding --incompatible_sandbox_hermetic_tmp
to and removing --sandbox_tmpfs_path=/tmp
from the test bazelrc.
My guess is that this is an instance of the problem that the integration tests of Bazel do the very thing which --incompatible_sandbox_hermetic_tmp
discourages, which is to use files in /tmp
in the sandbox (by design, but unfortunate)
Okay, at least this one is not a problem with --incompatible_sandbox_hermetic_tmp
and I have a workaround in our test battery and an the Android SDK seems to work with it. On to the NDK!
Update: https://buildkite.com/bazel/google-bazel-presubmit/builds/75806 is the test run that fixes the Android SDK.
It's quite red, but upon closer inspection, it turns out that the only failures are:
test_execroot_subdir_layout_fails_for_external_subpackages
tests which should arguably be removed because it's a good thing that they work with --incompatible_sandbox_hermetic_tmp
test_cc_test_llvm_coverage_produces_lcov_report_with_split_postprocessing
which I propose is niche enough to not warrant a cherry-pick (@c-mita do you agree?)So unless @c-mita or @meteorcloudy feel strongly about (2), I call the fire drill done. Then the next steps are:
I don't have much experience with coverage, I'll leave the decision to you and @c-mita. But overall, this sounds good to me!
I have a fix for the coverage issue. For my own records, it also requires setting the output user root to /tmp/output_user_root
in the test framework to work, but that one change is enough to fix the actual bug in Bazel so let's cherry-pick that to 7.0.1, too.
@meteorcloudy the last change to be cherry-picked (https://github.com/bazelbuild/bazel/commit/b0db044227d62178a7e578b8e03c452d8c17af33) is in; AFAIU there are no more issues related to this bug that are caused by an actual bug in Bazel, there are only ones that are caused by incompatibilities in its test battery. I have a change to fix one of those, too (cl/596858687 for the Android SDK) but it's not urgent anymore.
So I'll assume that the cherrypick to 7.0.1 will happen and demote this to P2.
@lberki https://github.com/bazelbuild/bazel/pull/20812 https://github.com/bazelbuild/bazel/pull/20813
Oops. I didn't realize I had to cherry-pick all three. I'll get back to you with new PR's
@lberki Here you go!: https://github.com/bazelbuild/bazel/pull/20821 https://github.com/bazelbuild/bazel/pull/20822
A thousand thanks and a bhattleship :)
@lberki I just ran bazel test //src/test/shell/bazel:bazel_java_test
locally on addf41bce1f26e8bc4bd0b09cb2fbffbb0446f25, which uses Bazel 7.0.1, and still see the failure. Is anything missing?
I know that there are still failures, that's why this bug is still open, but AFAICT they are not due to bugs in Bazel, but due to bugs in the test battery. That particular test is fixed by replacing an ln -s
with a cp
(otherwise, the final target of the symlink is under /tmp/bazel-source-roots
and the inner sandbox overwrites that directory)
Related bug: https://github.com/bazelbuild/bazel/issues/21190. Disabling --incompatible_sandbox_hermetic_tmp
in 7.0.2 doesn't help either.
Description of the bug:
Many of Bazel's own integration tests fail locally on Linux with an error such as the following:
This appears to be due to a combination of the outer Bazel process now running with
--incompatible_sandbox_hermetic_tmp
and the inner Bazel process running with--sandbox_tmpfs_path=/tmp
due to https://github.com/bazelbuild/bazel/blob/055e25b75c74d1e2c4c71694a938b6456323facc/src/test/shell/testenv.sh.tmpl#L261Which category does this issue belong to?
No response
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
Which operating system are you running Bazel on?
Linux
What is the output of
bazel info release
?No response
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.No response
What's the output of
git remote get-url origin; git rev-parse master; git rev-parse HEAD
?No response
Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.
No response
Have you found anything relevant by searching the web?
No response
Any other information, logs, or outputs that you want to share?
No response