bazelbuild / bazel

a fast, scalable, multi-language and extensible build system
https://bazel.build
Apache License 2.0
22.98k stars 4.03k forks source link

SIGBUS JVM error seen with bazel version 6.3.0 #23146

Open ryanmacdonald opened 1 month ago

ryanmacdonald commented 1 month ago

Description of the bug:

I'm seeing the following SIGBUS error signature while running a large Scala build with Bazel v6.3.0:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0x00007f90f3c68a39, pid=2, tid=3
#
# JRE version:  (11.0.15+10) (build )
# Java VM: OpenJDK 64-Bit Server VM (11.0.15+10-LTS, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0xc21a39]  PerfMemory::alloc(unsigned long)+0x59
#
# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /usr/local/blah/hs_err_pid2.log
#
#

I see in a few different previously filed GitHub issues (e.g., here and here) have resolved this error by adding the --sandbox_tmpfs_path=/tmp flag, however when I do this I see:

ERROR:
1722306775.109326470: src/main/tools/linux-sandbox.cc:152: calling pipe(2)...
1722306775.109397060: src/main/tools/linux-sandbox.cc:171: calling clone(2)...
1722306775.118793770: src/main/tools/linux-sandbox.cc:180: linux-sandbox-pid1 has PID 498031
1722306775.118867250: src/main/tools/linux-sandbox-pid1.cc:681: Pid1Main started
1722306775.119031019: src/main/tools/linux-sandbox.cc:197: done manipulating pipes
1722306775.156080366: src/main/tools/linux-sandbox-pid1.cc:275: tmpfs: /tmp
1722306775.163422045: src/main/tools/linux-sandbox-pid1.cc:285: working dir: /usr/local/home/ryanmacdonald/.cache/bazel/_bazel_ryanmacdonald/537546fbafb6167a7c1db
d6d108126ed/sandbox/linux-sandbox/20/execroot/
1722306775.282923973: src/main/tools/linux-sandbox-pid1.cc:320: writable: /usr/local/home/ryanmacdonald/.cache/bazel/_bazel_ryanmacdonald/537546fbafb6167a7c1dbd6d
108126ed/sandbox/linux-sandbox/20/execroot/darwinn_tpu
1722306775.282977563: src/main/tools/linux-sandbox-pid1.cc:320: writable: /tmp/cloud/batch/004684105
src/main/tools/linux-sandbox-pid1.cc:329: "mount(/tmp/cloud/batch/004684105, /tmp/cloud/batch/004684105, nullptr, MS_BIND | MS_REC, nullptr)": No such file

The error occurs sporadically, about 30-50% of the time in trials I've done

Which category does this issue belong to?

Core

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

N/A

Which operating system are you running Bazel on?

Red Hat EL 8.10

What is the output of bazel info release?

release 6.3.0

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse HEAD ?

error: No such remote 'origin'
2c29c0091687076a5145aa71bce95422f0de70f3

If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

ryanmacdonald commented 1 month ago

Worth noting that I see a similar error when I attempt to use Bazel v7.2.1 without the ----sandbox_tmpfs_path=/tmp flag:

ERROR: /workspace/us/cbf/user/ryanmacdonald/BUILD:9:19: Executing genrule //pkg-preprocess failed: (Exit 1): linux-sandbox failed: error executing Genrule command 
  (cd /usr/local/home/ryanmacdonald/.cache/bazel/_bazel_ryanmacdonald/997005025ac9f0daba86b61b2c3d2ad0/sandbox/linux-sandbox/40/execroot/_main && \
src/main/tools/linux-sandbox-pid1.cc:320: "mount(/tmp/cloud/batch/004741035, /tmp/cloud/batch/004741035, nullptr, MS_BIND | MS_REC, nullptr)": No such file or directory
fmeum commented 1 month ago

Could you share the Bazel command you are running with all its flags as well as the directory in which it runs? What's in /tmp/cloud?

ryanmacdonald commented 1 month ago

Full bazel command line:

bazel build --define proj=foo --compilation_mode=opt --remote_cache=<remote cache url> --extra_toolchains=@local_jdk//:all --sandbox_tmpfs_path=/tmp //path/to/target

/tmp/cloud/ has a batch subdirectory and then a bunch of subdirs under that with 9 digit names:

/tmp/cloud/batch> ls
001301595  003002190  004112935  005060954  005683908  013316919  014160521  014516820  014571466  014625115
001693463  003240165  004841924  005061070  007954108  013504958  014170704  014527071  014574927  014632679
001695585  003240703  004932092  005543441  008888472  013630435  014317761  014541580  014582087  014641326
002331398  003424045  004961724  005543559  009281013  014079770  014324237  014546661  014585025  014647758
002353218  003545566  005044664  005623208  012103415  014096863  014375605  014553554  014600433
003002110  003860811  005045323  005683519  012720324  014102594  014401783  014571380  014609712

All these dirs are empty

I tried adding --spawn_strategy=processwrapper-sandbox to these options and that seems to remove the linux-sandbox-pid1.cc error, but now the build errors out with a Java stack overflow about 15% of the time

fmeum commented 1 month ago

Do you know where these /tmp/cloud folders come from? Bazel may mount them if you were to run the build under /tmp, but it looks like you aren't. If these directories are updated concurrently, that could explain the failure.

ryanmacdonald commented 1 month ago

Ah, I forgot to include that I'm setting our --output_base=/tmp/bazel_build_<unique_id> as well for the above errors

These /tmp/cloud dirs are apparently created by our runner where each /tmp/cloud/batch/<#> dir is individually given to a job such that within the job $TMPDIR evaluates to some unique /tmp/cloud/batch/<#> dir

Should I be doing something like --output_base="$TMPDIR" and --sandbox_tmpfs_path="$TMPDIR"?

ryanmacdonald commented 1 month ago

Hey @fmeum, any other thoughts on this?

fmeum commented 1 month ago

You may be running into https://github.com/bazelbuild/bazel/issues/23217, albeit with a different error message. Does your build succeed without TMPDIR set?

meteorcloudy commented 1 month ago

@oquenchil Can you take a look?