Bazel server crashes at the very beginning (stack trace starts at CollectLocalResourceUsage).
The bug happens in a CI:
build runs inside a docker image: icr.io/continuous-delivery/pipeline/pipeline-base-ubi:3.40
build runs as root
java version is openjdk version "17.0.10" 2024-01-16 IBM Semeru Runtime Open Edition 17.0.10.0 (build 17.0.10+7)
Locally (MacOS) everything works perfectly. It looks like it's related to the container runtime (related?).
++ bazel --output_base=/workspace/app/bazel_output build --noexperimental_collect_resource_estimation --config=ci //...
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
Server terminated abruptly (error code: 14, error message: 'Socket closed', log file: '/workspace/app/bazel_output/server/jvm.out')
++ true
++ bazel --output_base=/workspace/app/bazel_output test --noexperimental_collect_resource_estimation --config=ci //...
WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 10)
WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 10)
INFO: Waited 10 seconds for server process (pid=941) to terminate.
FATAL: Attempted to kill stale server process (pid=941) using SIGKILL, but it did not die in a timely fashion.
++ true
++ cat /workspace/app/bazel_output/server/jvm.out
OpenJDK 64-Bit Server VM warning: Options -Xverify:none and -noverify were deprecated in JDK 13 and will likely be removed in a future release.
FATAL: bazel crashed due to an internal error. Printing stack trace:
java.lang.NullPointerException
at java.base/java.util.Objects.requireNonNull(Unknown Source)
at java.base/sun.nio.fs.UnixFileSystem.getPath(Unknown Source)
at java.base/java.nio.file.Path.of(Unknown Source)
at java.base/java.nio.file.Paths.get(Unknown Source)
at java.base/jdk.internal.platform.CgroupUtil.lambda$readStringValue$1(Unknown Source)
at java.base/java.security.AccessController.doPrivileged(Unknown Source)
at java.base/jdk.internal.platform.CgroupUtil.readStringValue(Unknown Source)
at java.base/jdk.internal.platform.CgroupSubsystemController.getStringValue(Unknown Source)
at java.base/jdk.internal.platform.cgroupv1.CgroupV1Subsystem.getCpuSetCpus(Unknown Source)
at java.base/jdk.internal.platform.CgroupMetrics.getCpuSetCpus(Unknown Source)
at [jdk.management/com.sun.management.internal.OperatingSystemImpl.isCpuSetSameAsHostCpuSet](http://jdk.management/com.sun.management.internal.OperatingSystemImpl.isCpuSetSameAsHostCpuSet)(Unknown Source)
at [jdk.management/com.sun.management.internal.OperatingSystemImpl$ContainerCpuTicks.getContainerCpuLoad](http://jdk.management/com.sun.management.internal.OperatingSystemImpl$ContainerCpuTicks.getContainerCpuLoad)(Unknown Source)
at [jdk.management/com.sun.management.internal.OperatingSystemImpl.getCpuLoad](http://jdk.management/com.sun.management.internal.OperatingSystemImpl.getCpuLoad)(Unknown Source)
at [jdk.management/com.sun.management.OperatingSystemMXBean.getSystemCpuLoad](http://jdk.management/com.sun.management.OperatingSystemMXBean.getSystemCpuLoad)(Unknown Source)
at [com.google.devtools.build.lib.profiler.CollectLocalResourceUsage.run](http://com.google.devtools.build.lib.profiler.collectlocalresourceusage.run/)([CollectLocalResourceUsage.java:144](http://collectlocalresourceusage.java:144/))
Note.-
I'm using --noexperimental_collect_resource_estimation but it looks like this flag has no effect (I don't really know if it should skip this CollectLocalResourceUsage step).
In the second call to Bazel, the server is running (stale), client tries to kill it, but it gives up.
Which category does this issue belong to?
Core
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
To reproduce the bug a simple bazel build //... is enough... however, it only fails in some environment under some circunstances (probably depends on the container runtime), so I think it's not easy to reproduce.
Which operating system are you running Bazel on?
No response
What is the output of bazel info release?
It also fails
If bazel info release returns development version or (@non-git), tell us how you built Bazel.
No response
What's the output of git remote get-url origin; git rev-parse HEAD ?
No response
If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.
No response
Have you found anything relevant by searching the web?
Description of the bug:
Bazel server crashes at the very beginning (stack trace starts at
CollectLocalResourceUsage
).The bug happens in a CI:
icr.io/continuous-delivery/pipeline/pipeline-base-ubi:3.40
root
openjdk version "17.0.10" 2024-01-16 IBM Semeru Runtime Open Edition 17.0.10.0 (build 17.0.10+7)
Locally (MacOS) everything works perfectly. It looks like it's related to the container runtime (related?).
Note.-
--noexperimental_collect_resource_estimation
but it looks like this flag has no effect (I don't really know if it should skip thisCollectLocalResourceUsage
step).Which category does this issue belong to?
Core
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
To reproduce the bug a simple
bazel build //...
is enough... however, it only fails in some environment under some circunstances (probably depends on the container runtime), so I think it's not easy to reproduce.Which operating system are you running Bazel on?
No response
What is the output of
bazel info release
?It also fails
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.No response
What's the output of
git remote get-url origin; git rev-parse HEAD
?No response
If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.
No response
Have you found anything relevant by searching the web?
Any other information, logs, or outputs that you want to share?
Is there any CLI flag I can use to "bypass" this resource collection (and avoid the issue)? Thanks!