bazelbuild / bazel

a fast, scalable, multi-language and extensible build system
https://bazel.build
Apache License 2.0
23.21k stars 4.06k forks source link

Bazel server crashes from the profiler (`CollectLocalResourceUsage`) #22955

Open jgsogo opened 4 months ago

jgsogo commented 4 months ago

Description of the bug:

Bazel server crashes at the very beginning (stack trace starts at CollectLocalResourceUsage).

The bug happens in a CI:

Locally (MacOS) everything works perfectly. It looks like it's related to the container runtime (related?).

++ bazel --output_base=/workspace/app/bazel_output build --noexperimental_collect_resource_estimation --config=ci //...
Extracting Bazel installation...
Starting local Bazel server and connecting to it...

Server terminated abruptly (error code: 14, error message: 'Socket closed', log file: '/workspace/app/bazel_output/server/jvm.out')

++ true
++ bazel --output_base=/workspace/app/bazel_output test --noexperimental_collect_resource_estimation --config=ci //...
WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 10)
WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 10)
INFO: Waited 10 seconds for server process (pid=941) to terminate.
FATAL: Attempted to kill stale server process (pid=941) using SIGKILL, but it did not die in a timely fashion.
++ true
++ cat /workspace/app/bazel_output/server/jvm.out
OpenJDK 64-Bit Server VM warning: Options -Xverify:none and -noverify were deprecated in JDK 13 and will likely be removed in a future release.
FATAL: bazel crashed due to an internal error. Printing stack trace:
java.lang.NullPointerException
    at java.base/java.util.Objects.requireNonNull(Unknown Source)
    at java.base/sun.nio.fs.UnixFileSystem.getPath(Unknown Source)
    at java.base/java.nio.file.Path.of(Unknown Source)
    at java.base/java.nio.file.Paths.get(Unknown Source)
    at java.base/jdk.internal.platform.CgroupUtil.lambda$readStringValue$1(Unknown Source)
    at java.base/java.security.AccessController.doPrivileged(Unknown Source)
    at java.base/jdk.internal.platform.CgroupUtil.readStringValue(Unknown Source)
    at java.base/jdk.internal.platform.CgroupSubsystemController.getStringValue(Unknown Source)
    at java.base/jdk.internal.platform.cgroupv1.CgroupV1Subsystem.getCpuSetCpus(Unknown Source)
    at java.base/jdk.internal.platform.CgroupMetrics.getCpuSetCpus(Unknown Source)
    at [jdk.management/com.sun.management.internal.OperatingSystemImpl.isCpuSetSameAsHostCpuSet](http://jdk.management/com.sun.management.internal.OperatingSystemImpl.isCpuSetSameAsHostCpuSet)(Unknown Source)
    at [jdk.management/com.sun.management.internal.OperatingSystemImpl$ContainerCpuTicks.getContainerCpuLoad](http://jdk.management/com.sun.management.internal.OperatingSystemImpl$ContainerCpuTicks.getContainerCpuLoad)(Unknown Source)
    at [jdk.management/com.sun.management.internal.OperatingSystemImpl.getCpuLoad](http://jdk.management/com.sun.management.internal.OperatingSystemImpl.getCpuLoad)(Unknown Source)
    at [jdk.management/com.sun.management.OperatingSystemMXBean.getSystemCpuLoad](http://jdk.management/com.sun.management.OperatingSystemMXBean.getSystemCpuLoad)(Unknown Source)
    at [com.google.devtools.build.lib.profiler.CollectLocalResourceUsage.run](http://com.google.devtools.build.lib.profiler.collectlocalresourceusage.run/)([CollectLocalResourceUsage.java:144](http://collectlocalresourceusage.java:144/))

Note.-

Which category does this issue belong to?

Core

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

To reproduce the bug a simple bazel build //... is enough... however, it only fails in some environment under some circunstances (probably depends on the container runtime), so I think it's not easy to reproduce.

Which operating system are you running Bazel on?

No response

What is the output of bazel info release?

It also fails

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse HEAD ?

No response

If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.

No response

Have you found anything relevant by searching the web?

Any other information, logs, or outputs that you want to share?

Is there any CLI flag I can use to "bypass" this resource collection (and avoid the issue)? Thanks!

planetf1 commented 4 months ago

Just to add, our understanding of the execution environment is:

planetf1 commented 2 months ago

I finally found that adding the following .bazelrc worked:

startup --host_jvm_args=-XX:-UseContainerSupport

Obviously it may be necessary to explicitly set memory, cpu to get the best performance — but at least there’s no exception now and bazel runs.

There's probably still a jvm issue - so please let me know if there's other things you'd like me to try to capture more info Additionally I think the system config of this machine is a bit unique -- something we'll look at in our environment

fmeum commented 2 months ago

Looks like https://bugs.openjdk.org/browse/JDK-8286212, which is still open.