bazelbuild / bazel

a fast, scalable, multi-language and extensible build system
https://bazel.build
Apache License 2.0
22.71k stars 3.98k forks source link

Bazel server crashes from the profiler (`CollectLocalResourceUsage`) #22955

Open jgsogo opened 5 days ago

jgsogo commented 5 days ago

Description of the bug:

Bazel server crashes at the very beginning (stack trace starts at CollectLocalResourceUsage).

The bug happens in a CI:

Locally (MacOS) everything works perfectly. It looks like it's related to the container runtime (related?).

++ bazel --output_base=/workspace/app/bazel_output build --noexperimental_collect_resource_estimation --config=ci //...
Extracting Bazel installation...
Starting local Bazel server and connecting to it...

Server terminated abruptly (error code: 14, error message: 'Socket closed', log file: '/workspace/app/bazel_output/server/jvm.out')

++ true
++ bazel --output_base=/workspace/app/bazel_output test --noexperimental_collect_resource_estimation --config=ci //...
WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 10)
WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 10)
INFO: Waited 10 seconds for server process (pid=941) to terminate.
FATAL: Attempted to kill stale server process (pid=941) using SIGKILL, but it did not die in a timely fashion.
++ true
++ cat /workspace/app/bazel_output/server/jvm.out
OpenJDK 64-Bit Server VM warning: Options -Xverify:none and -noverify were deprecated in JDK 13 and will likely be removed in a future release.
FATAL: bazel crashed due to an internal error. Printing stack trace:
java.lang.NullPointerException
    at java.base/java.util.Objects.requireNonNull(Unknown Source)
    at java.base/sun.nio.fs.UnixFileSystem.getPath(Unknown Source)
    at java.base/java.nio.file.Path.of(Unknown Source)
    at java.base/java.nio.file.Paths.get(Unknown Source)
    at java.base/jdk.internal.platform.CgroupUtil.lambda$readStringValue$1(Unknown Source)
    at java.base/java.security.AccessController.doPrivileged(Unknown Source)
    at java.base/jdk.internal.platform.CgroupUtil.readStringValue(Unknown Source)
    at java.base/jdk.internal.platform.CgroupSubsystemController.getStringValue(Unknown Source)
    at java.base/jdk.internal.platform.cgroupv1.CgroupV1Subsystem.getCpuSetCpus(Unknown Source)
    at java.base/jdk.internal.platform.CgroupMetrics.getCpuSetCpus(Unknown Source)
    at [jdk.management/com.sun.management.internal.OperatingSystemImpl.isCpuSetSameAsHostCpuSet](http://jdk.management/com.sun.management.internal.OperatingSystemImpl.isCpuSetSameAsHostCpuSet)(Unknown Source)
    at [jdk.management/com.sun.management.internal.OperatingSystemImpl$ContainerCpuTicks.getContainerCpuLoad](http://jdk.management/com.sun.management.internal.OperatingSystemImpl$ContainerCpuTicks.getContainerCpuLoad)(Unknown Source)
    at [jdk.management/com.sun.management.internal.OperatingSystemImpl.getCpuLoad](http://jdk.management/com.sun.management.internal.OperatingSystemImpl.getCpuLoad)(Unknown Source)
    at [jdk.management/com.sun.management.OperatingSystemMXBean.getSystemCpuLoad](http://jdk.management/com.sun.management.OperatingSystemMXBean.getSystemCpuLoad)(Unknown Source)
    at [com.google.devtools.build.lib.profiler.CollectLocalResourceUsage.run](http://com.google.devtools.build.lib.profiler.collectlocalresourceusage.run/)([CollectLocalResourceUsage.java:144](http://collectlocalresourceusage.java:144/))

Note.-

Which category does this issue belong to?

Core

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

To reproduce the bug a simple bazel build //... is enough... however, it only fails in some environment under some circunstances (probably depends on the container runtime), so I think it's not easy to reproduce.

Which operating system are you running Bazel on?

No response

What is the output of bazel info release?

It also fails

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse HEAD ?

No response

If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.

No response

Have you found anything relevant by searching the web?

Any other information, logs, or outputs that you want to share?

Is there any CLI flag I can use to "bypass" this resource collection (and avoid the issue)? Thanks!

planetf1 commented 4 days ago

Just to add, our understanding of the execution environment is: