bazelbuild / bazel

a fast, scalable, multi-language and extensible build system
https://bazel.build
Apache License 2.0
23.21k stars 4.06k forks source link

Bazel's own integration tests fail locally on Linux #20753

Open fmeum opened 10 months ago

fmeum commented 10 months ago

Description of the bug:

Many of Bazel's own integration tests fail locally on Linux with an error such as the following:

src/main/tools/linux-sandbox-pid1.cc:280: "The sandbox working directory cannot be below a path where we mount tmpfs (you requested mounting /tmp/bazel-working-directory/_main/_tmp/74701f16efd616fedbad9bb8b36df803/root/0a1489a1de031466aaaf200f31c5b29b/sandbox/linux-sandbox/1/execroot/_main in /tmp). Is your --output_base= below one of your --sandbox_tmpfs_path values?": Invalid argument

This appears to be due to a combination of the outer Bazel process now running with --incompatible_sandbox_hermetic_tmp and the inner Bazel process running with --sandbox_tmpfs_path=/tmp due to https://github.com/bazelbuild/bazel/blob/055e25b75c74d1e2c4c71694a938b6456323facc/src/test/shell/testenv.sh.tmpl#L261

Which category does this issue belong to?

No response

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

bazel test //src/test/shell/bazel:bazel_rules_cc_test --test_output=errors

Which operating system are you running Bazel on?

Linux

What is the output of bazel info release?

No response

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

fmeum commented 10 months ago

CC @lberki

lberki commented 10 months ago

Let me see if I can just remove that flag from the test bazelrc. Now that we have hermetic /tmp and a new JVM, it should not be necessary, but ISTR that there were some test breakages when I tried removing it on the side.

lberki commented 10 months ago

Note to self: bazel test --test_output=streamed //src/test/shell/bazel:cc_integration_test --nocache_test_results --test_filter=test_env_inherit_cc_test seems to be an easy way to debug what's going on (so far, I have no clue)

lberki commented 10 months ago

I bet it's the recent reshuffling of /tmp (https://github.com/bazelbuild/bazel/commit/620d617b440258799caa1be434ed66e9ca8fa8c5 or one of the others). What I see is:

linux-sandbox \
  ... \
  -w /tmp/bazel-working-directory/_main/_tmp/b3cdc2524f2b7d337eecbf6bc2c53cb7/root/bca1883186b7bd452a3b6f6da322260c/sandbox/linux-sandbox/4/execroot/_main/_tmp/12935b5e0c672f7d902613a432271f4c \
  ...
  -M /tmp/bazel-working-directory/_main/_tmp/b3cdc2524f2b7d337eecbf6bc2c53cb7/root/bca1883186b7bd452a3b6f6da322260c/sandbox/linux-sandbox/4/_hermetic_tmp -m /tmp   ...
...
src/main/tools/linux-sandbox-pid1.cc:311: writable: /tmp/bazel-working-directory/_main/_tmp/b3cdc2524f2b7d337eecbf6bc2c53cb7/root/bca1883186b7bd452a3b6f6da322260c/sandbox/linux-sandbox/4/execroot/_main/_tmp/12935b5e0c672f7d902613a432271f4c

And my guess is that this is because $TEST_TMPDIR gets its pre-remapping value.

lberki commented 10 months ago

Bingo! There is a sinister comment here:

https://github.com/bazelbuild/bazel/blob/fb4f63fb82f1bdfea572cee4d78c3a956745f3c7/src/main/java/com/google/devtools/build/lib/sandbox/AbstractSandboxSpawnRunner.java#L432

which indicates that 'll probably have to do more digging, sigh.

lberki commented 10 months ago

That comment probably references this code:

https://github.com/bazelbuild/bazel/blob/fb4f63fb82f1bdfea572cee4d78c3a956745f3c7/src/main/java/com/google/devtools/build/lib/sandbox/AbstractContainerizingSandboxedSpawn.java#L121

and apparently that's how it creates $TEST_TMPDIR.

Arguably, $TEST_TMPDIR is set to a completely wrong value because it's set to something that starts with the execroot (outside the sandbox) and the execroot inside the sandbox is something very different. But we also have --test_tmpdir and $TEST_TMPDIR is computed in StandaloneTestStrategy which doesn't know what sandbox (if any) is in use, so can't trivially make $TEST_TMPDIR something hard-wired under /tmp, at least not easily.

Grump squared.

lberki commented 10 months ago

Update: I have a fix for this, but I still see widespread breakages when I remove --sandbox_tmpfs_path from the test bazelrc, mostly in Android.

There are still a few test cases in cc_integration_test so I'll start with those because I would profoundly enjoy not having to debug Android builds.

fmeum commented 10 months ago

The test failure in the coverage integration test looks similar to this as of now unresolved Bazel 7.0.0 regression: https://github.com/bazelbuild/bazel/issues/20556

lberki commented 10 months ago

I had some scraps of time to track down test_execroot_subdir_layout_fails_for_external_subpackages.

One lesson is that the presence of --sandbox_tmpfs_path=/tmp negated the flag flip for --incompatible_sandbox_hermetic_tmp in our test battery, so the functionality was untested the whole time. Go me.

For that test case in particular, one could argue that it's WAI: it tests that builds requiring files under the external/ directory don't work (because they are overridden by external repositories). The reason why the test case fails with hermetic /tmp is that the new layout of the sandbox actually makes it possible to access those files from within the action. Of course, it's a namespace clash (external/foo/bar.cc and @foo//:bar.cc have the same execpath) which is disgusting, but one could argue that this is in fact an improvement.

There are a number of other test cases that I think fail for the same reason and Androd builds are hopelessly broken. haven't been able to figure out how because the little scraps of time I had didn't allow me to put together and Android SDK and NDK good enough for Bazel today..

lberki commented 10 months ago

In addition to test_execroot_subdir_layout_fails_for_external_subpackages I found at least three (no joke) different issues:

  1. $TEST_TMPDIR is not handled appropriately. Its value is set to something under the outside-of-the-sandbox execroot, which is, while available inside the sandbox, is probably wrong. It also fails spectacularly if the output base happens to be under /tmp . This is pretty easy (if moderately hacky) to fix.

  2. The Linux sandbox now relies on the parent tree artifacts of any tree file artifacts in the inputs of the action to be present in the InputMetadataProvider. Turns out, this is not guaranteed and there is one very special case where a TreeFileArtifact in fact goes without its parent tree artifact: action templates. This is also pretty easy to fix and although a call site of SkyframeActionExecutor.requiresTreeMetadataWhenTreeFileIsInput() has a warning about worse performance when it's true, I think it's only a minor performance loss so the fix is actually a one-liner.

  3. Hermetic /tmp does not work if the source tree or an external repository contains symlinks to /tmp. This is almost a tautology, but unfortunately, it is also the case for our integration tests since external repositories that are "bridged" from the outer to the inner Bazel invocation (e.g. the Android SDK) are just like that: the repository is a separate source root in both Bazel invocations, thus, the outer Bazel invocation makes it visible to the test under /tmp/bazel-source-roots/$i and $OUTPUT_BASE/external/$repo contains symlnks to that.

I'm not sure if the right approach here is to special-case /tmp/bazel-source-roots, to flip off the flag in our tests (this time, for a good reason) or to fix it in our integration testing framework.

lberki commented 10 months ago

wrt. Bazel 7.0.1, I propose we do the following:

  1. We cherry-pick https://github.com/bazelbuild/bazel/commit/c48392c54a92b7bc5b04883bd6499c7a5677205d and https://github.com/bazelbuild/bazel/commit/bc1d9d38bea907beea8fd82c362c9882ddf48cfd
  2. poke at our integration test battery to see if there are any more issues that are currently hidden by the fact that we in fact use /tmp in such a way that is specifically contraindicated with --incompatible_sandbox_hermetic_tmp

If (2) doesn't make any more skeletons fall out from the closet, we call the fire drill done. If it does, well, that's unfortunate.

@meteorcloudy WDYT?

iancha1992 commented 10 months ago

@bazel-io fork 7.0.1

iancha1992 commented 10 months ago

@bazel-io fork 7.1.0

lberki commented 10 months ago

Yet Another issue: it looks like deploy jar actions don't quite work. The symptom is:

ERROR: /tmp/bazel-working-directory/_main/_tmp/f6553c75b5e5f9d32addcdb402698239/workspace/java/com/a/BUILD:1:12: Building deploy jar java/com/a/a_deploy.jar failed: (Exit 1): singlejar_local failed: error executing JavaDeployJar command (from target //java/com/a:a_deployjars_internal_rule) external/rules_java~7.3.2~toolchains~remote_java_tools_linux/java_tools/src/tools/singlejar/singlejar_local @bazel-out/k8-fastbuild/bin/java/com/a/a_deploy.jar-0.params

Quick debugging shows that the action input bazel-out/k8-fastbuild/bin/external/bazel_tools/tools/build_defs/build_info/redacted_file.properties is a symlink to /tmp/bazel-working-directory/_main/_tmp/f6553c75b5e5f9d32addcdb402698239/root/install/5c1a48ea43805eea854857f7fc475a0d/embedded_tools/tools/build_defs/build_info/templates/redacted_file.properties.template, which is not available in the sandbox of the inner Bazel instance. I haven't checked yet why. Repro is:

  mkdir -p java/com/a
  cat > java/com/a/BUILD <<'EOF'
java_binary(name="a", srcs=["A.java"])
EOF

  cat > java/com/a/A.java <<'EOF'
package a;

class A {
  public static void main(String[] args) {
    System.out.println("hello");
  }
}
EOF
  bazel build //java/com/a:a_deploy.jar

After adding --incompatible_sandbox_hermetic_tmp to and removing --sandbox_tmpfs_path=/tmp from the test bazelrc.

My guess is that this is an instance of the problem that the integration tests of Bazel do the very thing which --incompatible_sandbox_hermetic_tmp discourages, which is to use files in /tmp in the sandbox (by design, but unfortunate)

lberki commented 10 months ago

Okay, at least this one is not a problem with --incompatible_sandbox_hermetic_tmp and I have a workaround in our test battery and an the Android SDK seems to work with it. On to the NDK!

lberki commented 10 months ago

Update: https://buildkite.com/bazel/google-bazel-presubmit/builds/75806 is the test run that fixes the Android SDK.

It's quite red, but upon closer inspection, it turns out that the only failures are:

  1. various test_execroot_subdir_layout_fails_for_external_subpackages tests which should arguably be removed because it's a good thing that they work with --incompatible_sandbox_hermetic_tmp
  2. test_cc_test_llvm_coverage_produces_lcov_report_with_split_postprocessing which I propose is niche enough to not warrant a cherry-pick (@c-mita do you agree?)

So unless @c-mita or @meteorcloudy feel strongly about (2), I call the fire drill done. Then the next steps are:

  1. cherry-pick the above changes to 7.0.1
  2. Demote this bug to P2 to reflect the decreased urgency
  3. Fix the remaining test cases and flip the flag in our test battery whenever I have time
meteorcloudy commented 10 months ago

I don't have much experience with coverage, I'll leave the decision to you and @c-mita. But overall, this sounds good to me!

lberki commented 10 months ago

I have a fix for the coverage issue. For my own records, it also requires setting the output user root to /tmp/output_user_root in the test framework to work, but that one change is enough to fix the actual bug in Bazel so let's cherry-pick that to 7.0.1, too.

lberki commented 10 months ago

@meteorcloudy the last change to be cherry-picked (https://github.com/bazelbuild/bazel/commit/b0db044227d62178a7e578b8e03c452d8c17af33) is in; AFAIU there are no more issues related to this bug that are caused by an actual bug in Bazel, there are only ones that are caused by incompatibilities in its test battery. I have a change to fix one of those, too (cl/596858687 for the Android SDK) but it's not urgent anymore.

So I'll assume that the cherrypick to 7.0.1 will happen and demote this to P2.

iancha1992 commented 10 months ago

@lberki https://github.com/bazelbuild/bazel/pull/20812 https://github.com/bazelbuild/bazel/pull/20813

Oops. I didn't realize I had to cherry-pick all three. I'll get back to you with new PR's

iancha1992 commented 10 months ago

@lberki Here you go!: https://github.com/bazelbuild/bazel/pull/20821 https://github.com/bazelbuild/bazel/pull/20822

lberki commented 10 months ago

A thousand thanks and a bhattleship :)

fmeum commented 9 months ago

@lberki I just ran bazel test //src/test/shell/bazel:bazel_java_test locally on addf41bce1f26e8bc4bd0b09cb2fbffbb0446f25, which uses Bazel 7.0.1, and still see the failure. Is anything missing?

lberki commented 9 months ago

I know that there are still failures, that's why this bug is still open, but AFAICT they are not due to bugs in Bazel, but due to bugs in the test battery. That particular test is fixed by replacing an ln -s with a cp (otherwise, the final target of the symlink is under /tmp/bazel-source-roots and the inner sandbox overwrites that directory)

daivinhtran commented 9 months ago

Related bug: https://github.com/bazelbuild/bazel/issues/21190. Disabling --incompatible_sandbox_hermetic_tmp in 7.0.2 doesn't help either.