bazelbuild / bazel

a fast, scalable, multi-language and extensible build system
https://bazel.build
Apache License 2.0
23.04k stars 4.03k forks source link

"No such file or directory" when upgrading from v6.5.0 to 7.x.x #23743

Open swarren12 opened 4 days ago

swarren12 commented 4 days ago

Description of the bug:

I'm trying to update a fairly complicated Bazel project from Bazel v6.5.0 to v7.x.x, but encountering strange issues. Unfortunately, I can't pinpoint exactly where the issue lies, but I believe it is in Bazel itself, rather than any of the rules being imported.

Expected behaviour: upgrading from v6.5.0 to v7.x.x "just works" Actual behaviour: the build fails due to files inside the linux-sandbox not being found

More details Currently, on Bazel v6.5.0, the build reliably passes both on local development workstations and in the CI environment. Upgrading to v7.x.x causes the build to occasionally fail on development machines and much more consistently fail in CI. Unfortunately, I've been unable to reproduce in an isolated example project, and I'm not sure exactly how to go about collecting more information on the problem.

I've tried upgrading to v7.0.0, v7.1.2, v7.2.1 and v7.3.1 but they all behave the same way.

It's not always the same target that fails, but it's always roughly for the same reason, which is that a file is not found within the sandbox.

One example of this is shown below. A bazel clean --expunge was run first, and then (an equivalent of) bazel test //... --test_tag_filters=smoke, which first failed when trying to create an ijar for a java_import for a file checked into version control:

Use --sandbox_debug to see verbose messages from the sandbox and retain the sandbox build root for debugging
src/main/tools/linux-sandbox-pid1.cc:530: "execvp(external/rules_java~~toolchains~remote_java_tools_linux/java_tools/ijar/ijar, 0x1d1014a0)": No such file or directory
ERROR: lib/BUILD:2782:13: Extracting interface for jar lib/3rd-party/io.netty/netty-codec-haproxy/netty-codec-haproxy-4.1.113.Final.jar failed: (Exit 1): ijar failed: error executing JavaIjar command (from target //lib:netty-codec-haproxy) 
  (cd /home/warrens/.cache/bazel/_bazel_warrens/cf05af78ffeddb63393e16c80fd92083/sandbox/linux-sandbox/2/execroot/_main && \
  exec env - \
    PATH=/bin:/usr/bin:/usr/local/bin \
  external/rules_java~~toolchains~remote_java_tools_linux/java_tools/ijar/ijar lib/3rd-party/io.netty/netty-codec-haproxy/netty-codec-haproxy-4.1.113.Final.jar bazel-out/k8-fastbuild/bin/lib/_ijar/netty-codec-haproxy/lib/3rd-party/io.netty/netty-codec-haproxy/netty-codec-haproxy-4.1.113.Final-ijar.jar --target_label //lib:netty-codec-haproxy)

Running the same bazel test command a second time also resulted in a failure, this time failing to run java:

ERROR: [snip]/BUILD:31:14: Building [snip]/SomeJavaTest.jar () failed: IOException while preparing the execution environment of a worker:
...
---8<---8<--- Exception details ---8<---8<---
java.io.IOException: Cannot run program "/home/warrens/.cache/bazel/_bazel_warrens/cf05af78ffeddb63393e16c80fd92083/execroot/_main/external/_main~java_repositories~jdk11/bin/java" (in directory "/home/warrens/.cache/bazel/_bazel_warrens/cf05af78ffeddb63393e16c80fd92083/execroot/_main"): error=2, No such file or directory
        at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1170)
        at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1089)
        at com.google.devtools.build.lib.shell.JavaSubprocessFactory.start(JavaSubprocessFactory.java:152)
...
Caused by: java.io.IOException: error=2, No such file or directory
        at java.base/java.lang.ProcessImpl.forkAndExec(Native Method)
        at java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:295)
        at java.base/java.lang.ProcessImpl.start(ProcessImpl.java:225)
        at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1126)

A third run of bazel test completed successfully.

A separate example, taken from the CI exhibits a similar mode of failure; this time during running of some tests:

ERROR: [snip]/BUILD:11:10: Testing //...:some-custom-test-rule failed: (Exit 1): generate-xml.sh failed: error executing TestRunner command (from target //...:some-custom-test-rule) 
  (cd /var/lib/jenkins/.cache/bazel/_bazel_jenkins/38b07d741dde33298ed2fff99f485394/sandbox/linux-sandbox/34/execroot/_main && \
  exec env - \
    EXPERIMENTAL_SPLIT_XML_GENERATION=1 \
    JAVA_RUNFILES=bazel-out/k8-fastbuild/bin/.../some-custom-test-rule.runfiles \
    PATH=/bin:/usr/bin:/usr/local/bin \
    PYTHON_RUNFILES=bazel-out/k8-fastbuild/bin/.../some-custom-test-rule.runfiles \
    RUNFILES_DIR=bazel-out/k8-fastbuild/bin/.../some-custom-test-rule.runfiles \
    RUN_UNDER_RUNFILES=1 \
    TEST_BINARY=.../some-custom-test-rule \
    TEST_INFRASTRUCTURE_FAILURE_FILE=bazel-out/k8-fastbuild/testlogs/.../some-custom-test-rule/test.infrastructure_failure \
    TEST_LOGSPLITTER_OUTPUT_FILE=bazel-out/k8-fastbuild/testlogs/.../some-custom-test-rule/test.raw_splitlogs/test.splitlogs \
    TEST_NAME=//...:some-custom-test-rule \
    TEST_PREMATURE_EXIT_FILE=bazel-out/k8-fastbuild/testlogs/.../some-custom-test-rule/test.exited_prematurely \
    TEST_SHARD_INDEX=0 \
    TEST_SIZE=small \
    TEST_SRCDIR=bazel-out/k8-fastbuild/bin/.../some-custom-test-rule.runfiles \
    TEST_TARGET=//...:some-custom-test-rule \
    TEST_TIMEOUT=60 \
    TEST_TMPDIR=_tmp/ff60cd74048852c7bacd3c1d1b00a8f2 \
    TEST_TOTAL_SHARDS=0 \
    TEST_UNDECLARED_OUTPUTS_ANNOTATIONS=bazel-out/k8-fastbuild/testlogs/.../some-custom-test-rule/test.outputs_manifest/ANNOTATIONS \
    TEST_UNDECLARED_OUTPUTS_ANNOTATIONS_DIR=bazel-out/k8-fastbuild/testlogs/.../some-custom-test-rule/test.outputs_manifest \
    TEST_UNDECLARED_OUTPUTS_DIR=bazel-out/k8-fastbuild/testlogs/.../some-custom-test-rule/test.outputs \
    TEST_UNDECLARED_OUTPUTS_MANIFEST=bazel-out/k8-fastbuild/testlogs/.../some-custom-test-rule/test.outputs_manifest/MANIFEST \
    TEST_UNDECLARED_OUTPUTS_ZIP=bazel-out/k8-fastbuild/testlogs/.../some-custom-test-rule/test.outputs/outputs.zip \
    TEST_UNUSED_RUNFILES_LOG_FILE=bazel-out/k8-fastbuild/testlogs/.../some-custom-test-rule/test.unused_runfiles_log \
    TEST_WARNINGS_OUTPUT_FILE=bazel-out/k8-fastbuild/testlogs/.../some-custom-test-rule/test.warnings \
    TEST_WORKSPACE=_main \
    TZ=UTC \
    XML_OUTPUT_FILE=bazel-out/k8-fastbuild/testlogs/.../some-custom-test-rule/test.xml \
  external/bazel_tools/tools/test/generate-xml.sh bazel-out/k8-fastbuild/testlogs/.../some-custom-test-rule/test.log bazel-out/k8-fastbuild/testlogs/.../some-custom-test-rule/test.xml 0 1)
# Configuration: 96d1f52e073df1fb1edb92e576742c56c7c33cdfdf7dc366cbda968896be461f
# Execution platform: @@platforms//host:host

At first glance, this looked to be a different type of failure; however, cating the test.log shows:

$ cat bazel-out/k8-fastbuild/testlogs/.../some-custom-test-rule/test.log
src/main/tools/linux-sandbox-pid1.cc:530: "execvp(external/bazel_tools/tools/test/test-setup.sh, 0xbd0c10)": No such file or directory

Some observations:

Currently, I'm leaning towards this being a problem with multiple processes trying to interact with the sandbox at the same time; this would explain why I'm unable to reproduce it on a small project and why it fails more consistently in CI (bigger box with more cores to run tasks in parallel). However if I add --jobs=1 the problem persists, which suggests this hypothesis is wrong.

Any suggestions on what could be tried in order to further triage or resolve this issue would be appreciated.

Which category does this issue belong to?

Core

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

As mentioned, I can reproduce it fairly reliably on a large project; unfortunately, I'm yet to find a way of reproducing it on a example project. I'll keep trying though!

Which operating system are you running Bazel on?

Linux (Fedora & CentOS)

What is the output of bazel info release?

7.3.1

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

n/a

What's the output of git remote get-url origin; git rev-parse HEAD ?

n/a

If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.

I'm unable to run bazelisk --bisect=6.5.0..7.0.0 because it attempts to revert back to v6.0.0, which is incompatible with most of the rules in MODULE.bazel :(

I'm trying to work out if I can use some of the 7.0.0 pre-release candidates to narrow down when it started failing, but so far no luck.

Have you found anything relevant by searching the web?

I found a similar sounding issue here: https://github.com/bazelbuild/bazel/issues/22151; however, this affects v6.5.0 and that is the version that is working for me.

Similarly, some comments in https://github.com/bazelbuild/bazel/pull/19943 seemed relevant, but I couldn't really turn them into useful avenues of investigation.

Any other information, logs, or outputs that you want to share?

Output of `bazel mod graph`

``` (monorepo@_) ├───aspect_bazel_lib@1.42.1 │ ├───bazel_skylib@1.6.1 (*) │ ├───platforms@0.0.10 (*) │ └───stardoc@0.5.4 │ ├───bazel_skylib@1.6.1 (*) │ ├───rules_java@7.3.1 (*) │ └───rules_license@1.0.0 (*) ├───bazel_skylib@1.6.1 │ └───platforms@0.0.10 (*) ├───platforms@0.0.10 │ └───rules_license@1.0.0 (*) ├───rules_cc@0.0.9 │ └───platforms@0.0.10 (*) ├───rules_java@7.3.1 │ ├───bazel_skylib@1.6.1 (*) │ ├───platforms@0.0.10 (*) │ ├───rules_cc@0.0.9 (*) │ ├───rules_license@1.0.0 (*) │ └───rules_proto@6.0.0 │ ├───bazel_features@1.11.0 (*) │ ├───bazel_skylib@1.6.1 (*) │ └───rules_license@1.0.0 (*) ├───rules_jvm_external@5.3 │ ├───bazel_skylib@1.6.1 (*) │ └───stardoc@0.5.4 (*) ├───rules_license@1.0.0 ├───rules_oci@1.7.6 │ ├───aspect_bazel_lib@1.42.1 (*) │ ├───bazel_skylib@1.6.1 (*) │ ├───platforms@0.0.10 (*) │ └───container_structure_test@1.16.0 │ ├───aspect_bazel_lib@1.42.1 (*) │ ├───bazel_skylib@1.6.1 (*) │ └───platforms@0.0.10 (*) ├───rules_pkg@0.9.1 │ ├───bazel_skylib@1.6.1 (*) │ ├───rules_license@1.0.0 (*) │ └───rules_python@0.34.0 (*) └───rules_python@0.34.0 ├───bazel_skylib@1.6.1 (*) ├───platforms@0.0.10 (*) ├───rules_cc@0.0.9 (*) ├───rules_proto@6.0.0 (*) ├───bazel_features@1.11.0 │ └───bazel_skylib@1.6.1 (*) └───protobuf@24.4 ├───bazel_skylib@1.6.1 (*) ├───platforms@0.0.10 (*) ├───rules_cc@0.0.9 (*) ├───rules_java@7.3.1 (*) ├───rules_jvm_external@5.3 (*) ├───rules_pkg@0.9.1 (*) ├───rules_proto@6.0.0 (*) ├───abseil-cpp@20230802.0.bcr.1 │ ├───bazel_skylib@1.6.1 (*) │ ├───googletest@1.14.0 (*) │ ├───platforms@0.0.10 (*) │ └───rules_cc@0.0.9 (*) ├───googletest@1.14.0 │ ├───abseil-cpp@20230802.0.bcr.1 (*) │ ├───platforms@0.0.10 (*) │ └───rules_cc@0.0.9 (*) ├───upb@0.0.0-20230516-61a97ef │ ├───abseil-cpp@20230802.0.bcr.1 (*) │ ├───bazel_skylib@1.6.1 (*) │ ├───platforms@0.0.10 (*) │ ├───rules_pkg@0.9.1 (*) │ └───rules_proto@6.0.0 (*) └───zlib@1.3.1.bcr.3 ├───platforms@0.0.10 (*) └───rules_cc@0.0.9 (*) ```

I've tried upgrading various rules, but with no luck (and generally bringing in other difficulties!).

tjgq commented 4 days ago

Some ideas to gather more information:

  1. Can you reproduce this with --spawn_strategy=standalone (i.e., does it also happen with sandboxing disabled?)
  2. Can you inspect the contents of the sandbox left over by --sandbox_debug and verify that the executable is present at the expected location? In particular, if it is a symlink, does the symlink dangle, or does it point to the expected file?
  3. Since this seems to occur for at least two different rules, can you reduce it further? Say, does a minimal genrule like the one below also reproduce the issue?
  4. Can you provide the full list of flags you're using, including the ones set in blazercs?
genrule(
  name = "gen",
  outs = ["out.txt"],
  cmd = "touch $@",
)
swarren12 commented 4 days ago

It'll take me a while to go through all of those suggestions, so I'll update this comment as I go, but:

1. Can you reproduce this with --spawn_strategy=standalone (i.e., does it also happen with sandboxing disabled?)

Yes, it appears I can:

$ cat bazel-out/k8-fastbuild/testlogs/.../test.log
src/main/tools/process-wrapper-legacy.cc:80: "execvp(external/bazel_tools/tools/test/test-setup.sh, ...)": No such file or directory

2. Can you inspect the contents of the sandbox left over by --sandbox_debug and verify that the executable is present at the expected location? In particular, if it is a symlink, does the symlink dangle, or does it point to the expected file?

I think the answer here is "sometimes" (or "I don't always know how to read the sandbox debug output properly")

ERROR: /var/lib/jenkins/.cache/bazel/_bazel_jenkins/38b07d741dde33298ed2fff99f485394/external/rules_jvm_external~/private/tools/java/com/github/bazelbuild/rules_jvm_external/zip/BUILD:1:13: Compiling Java headers external/rules_jvm_external~/private/tools/java/com/github/bazelbuild/rules_jvm_external/zip/libzip-hjar.jar (1 source file) [for tool] failed: (Exit 1): linux-sandbox failed: error executing Turbine command 
  (cd /var/lib/jenkins/.cache/bazel/_bazel_jenkins/38b07d741dde33298ed2fff99f485394/sandbox/linux-sandbox/62/execroot/_main && \
...
src/main/tools/linux-sandbox-pid1.cc:530: "execvp(external/_main~java_repositories~jdk11/bin/java, 0xf5ab30)": No such file or directory
ERROR: /var/lib/jenkins/.cache/bazel/_bazel_jenkins/38b07d741dde33298ed2fff99f485394/external/rules_jvm_external~/private/tools/java/com/github/bazelbuild/rules_jvm_external/jar/BUILD:3:12 Building external/rules_jvm_external~/private/tools/java/com/github/bazelbuild/rules_jvm_external/jar/AddJarManifestEntry.jar (1 source file) [for tool] failed: (Exit 1): linux-sandbox failed: error executing Turbine command 

and then:

$ ls /var/lib/jenkins/.cache/bazel/_bazel_jenkins/38b07d741dde33298ed2fff99f485394/sandbox/linux-sandbox/62/execroot/_main/external/rules_jvm_external~/private/tools/java/com/github/bazelbuild/rules_jvm_external/jar/AddJarManifestEntry.jar
ls: cannot access '/var/lib/jenkins/.cache/bazel/_bazel_jenkins/38b07d741dde33298ed2fff99f485394/sandbox/linux-sandbox/62/execroot/_main/external/rules_jvm_external~/private/tools/java/com/github/bazelbuild/rules_jvm_external/jar/AddJarManifestEntry.jar': No such file or directory

In fact, there seemed to be quite a few missing files and at least one broken symlink under sandbox/linux-sandbox/62/execroot/_main/external/.

But on another run:

ERROR: lib/BUILD:1498:13: Extracting interface for jar lib/3rd-party/org.hamcrest/hamcrest/hamcrest-core-1.3.jar failed: (Exit 1): linux-sandbox failed: error executing JavaIjar command 
  (cd /var/lib/jenkins/.cache/bazel/_bazel_jenkins/38b07d741dde33298ed2fff99f485394/sandbox/linux-sandbox/105/execroot/_main && \
  exec env - \
    PATH=/bin:/usr/bin:/usr/local/bin \
    TMPDIR=/tmp \
  /var/lib/jenkins/.cache/bazel/_bazel_jenkins/install/5d4256ba95eeafc7a3485f16e4778c0d/linux-sandbox -t 15 -w /dev/shm -w /tmp -w /var/lib/jenkins/.cache/bazel/_bazel_jenkins/38b07d741dde33298ed2fff99f485394/sandbox/linux-sandbox/105/execroot/_main -M /tmp -S /var/lib/jenkins/.cache/bazel/_bazel_jenkins/38b07d741dde33298ed2fff99f485394/sandbox/linux-sandbox/105/stats.out -N -D /var/lib/jenkins/.cache/bazel/_bazel_jenkins/38b07d741dde33298ed2fff99f485394/sandbox/linux-sandbox/105/debug.out -- external/rules_java~~toolchains~remote_java_tools_linux/java_tools/ijar/ijar lib/3rd-party/org.hamcrest/hamcrest/hamcrest-core-1.3.jar bazel-out/k8-fastbuild/bin/lib/_ijar/hamcrest-core/lib/3rd-party/org.hamcrest/hamcrest/hamcrest-core-1.3-ijar.jar --target_label //lib:hamcrest-core)
src/main/tools/linux-sandbox-pid1.cc:530: "execvp(external/rules_java~~toolchains~remote_java_tools_linux/java_tools/ijar/ijar, 0x122e710)": No such file or directory

gives:

$ ls --color -lA /var/lib/jenkins/.cache/bazel/_bazel_jenkins/38b07d741dde33298ed2fff99f485394/sandbox/linux-sandbox/105/execroot/_main/external/rules_java~~toolchains~remote_java_tools_linux/java_tools/ijar/ijar
lrwxrwxrwx. 1 jenkins jenkins 169 Sep 24 21:23 /var/lib/jenkins/.cache/bazel/_bazel_jenkins/38b07d741dde33298ed2fff99f485394/sandbox/linux-sandbox/105/execroot/_main/external/rules_java~~toolchains~remote_java_tools_linux/java_tools/ijar/ijar -> /var/lib/jenkins/.cache/bazel/_bazel_jenkins/38b07d741dde33298ed2fff99f485394/execroot/_main/external/rules_java~~toolchains~remote_java_tools_linux/java_tools/ijar/ijar

$ ls -lA /var/lib/jenkins/.cache/bazel/_bazel_jenkins/38b07d741dde33298ed2fff99f485394/execroot/_main/external/rules_java~~toolchains~remote_java_tools_linux/java_tools/ijar/ijar
-r-xr-xr-x. 1 jenkins jenkins 228368 Nov 30  2023 /var/lib/jenkins/.cache/bazel/_bazel_jenkins/38b07d741dde33298ed2fff99f485394/execroot/_main/external/rules_java~~toolchains~remote_java_tools_linux/java_tools/ijar/ijar

which does indeed seem to exist?

3. Since this seems to occur for at least two different rules, can you reduce it further? Say, does a minimal genrule like the one below also reproduce the issue?

I'm not able to reproduce it with a simple genrule; I've tried adding the sample and we have quite a few basic rules already in the code base, but I'm yet to observe any of them fail. I'll keep an eye out, but so far it appears these rules are not affected.

However, tangentially related, I have seen some of our custom Bazel rules using ctx.actions.run() and ctx.actions.run_shell() fail.

4. Can you provide the full list of flags you're using, including the ones set in blazercs?

INFO: Reading 'startup' options from /var/lib/jenkins/workspace/.bazelrc: --host_jvm_args=-Djavax.net.ssl.trustStore=internal.truststore
INFO: Options provided by the client:
  Inherited 'common' options: --isatty=1 --terminal_columns=184
INFO: Reading rc options for 'test' from /var/lib/jenkins/workspace/.bazelrc:
  Inherited 'common' options: --enable_bzlmod  --experimental_downloader_config=build/bazel/downloader.cfg --experimental_allow_tags_propagation --incompatible_no_implicit_file_export --java_language_version=11 --tool_java_language_version=11 --java_runtime_version=custom_11 --tool_java_runtime_version=custom_11 --announce_rc --attempt_to_print_relative_paths
INFO: Reading rc options for 'test' from /var/lib/jenkins/workspace/.bazelrc:
  Inherited 'build' options: --incompatible_strict_action_env --incompatible_enable_cc_toolchain_resolution --use_ijars --experimental_strict_java_deps=strict --explicit_java_test_deps --strategy=MakeRpm=local --nosandbox_default_allow_network --verbose_failures --show_result=50 --strategy_regexp=benchmark=standalone
INFO: Reading rc options for 'test' from /var/lib/jenkins/workspace/cache.bazelrc:
  Inherited 'build' options: --remote_cache=http://internal.cache:8081 --remote_timeout=120 --noremote_upload_local_results
INFO: Reading rc options for 'test' from /var/lib/jenkins/workspace/.bazelrc:
  'test' options: --test_output=errors --test_summary=terse
INFO: Reading rc options for 'test' from /var/lib/jenkins/workspace/local.bazelrc:
  'test' options: --test_output=summary --test_summary=short

I've tried removing the Java overrides (we need them in CI, but I can build without them on the development machine) but it had no effect. Similarly, enabling/disabling the remote caching also seems to change nothing. Finally I also tried removing --worker_quit_after_build and --worker_sandboxing locally, but the issue still persisted.

swarren12 commented 4 days ago

I couldn't get bazelisk --bisect to play nice, but I think I've narrowed it down to something that happened between 7.0.0-pre.20231011.2 and 7.0.0-pre.20231018.3. The former seems to build fine, the latter does not.

Tentative culprit: https://github.com/bazelbuild/bazel/commit/1b729a5bb44c244556437ed6a330d25f7e19e3c4

I've managed to do a clean build using --noexperimental_merged_skyframe_analysis_execution. I'm going to run it a few more times before saying definitively that that is the cause, but it's looking promising!

fmeum commented 4 days ago

@joeleba

joeleba commented 4 days ago

This sounds a bit like #22073. Could you try this out with a bazel version >= 7.2.0?

swarren12 commented 4 days ago

This sounds a bit like #22073. Could you try this out with a bazel version >= 7.2.0?

I've tried on v7.0.0, v7.1.2, v7.2.1 and v7.3.1 and all of them seem to exhibit the same behaviour.