Open uajith opened 3 years ago
Do you have an easy way to reproduce this?
No other way I could see.
meet the same problem @uajith @meisterT
@clemente0420 in a different environment?
@clemente0420 in a different environment?
centos docker in ubuntu 1604 x86 host,use bazelisk clean ,then happens
@uajith @meisterT got problem, check your host zombie process
Ok, it works.. Looks like the zombie process is causing the problem.
Can you confirm whether you're still seeing real (non-zombies) Bazel processes that you can't get rid off?
If these are zombie processes, I think it would help to run an init process inside your container (with Docker we like to use docker run --init
for this purpose).
I just ran into the same issue, also inside a container env that we use for all kinds of build processes, including bitbake which has a client-server structure as well. Simply issuing bazel
and then bazel shutdown
exposes the issue, adding the otherwise unneeded --init
to the container env works around it.
What makes only bazel stumble here? Can't this be resolved differently?
What does the server log say during the shutdown (use bazel info | grep server_log
before shutdown to find the log)?
This is with bazel-bootstrap from Debian bullseye: java.log.90be0d0794f5.builder.log.java.20220110-172052.2484.txt
I don't find messages being added when shutdown
is invoked, though.
It does indicate that it finishes the shutdown command within a fraction of a second. So it's not that something within Bazel itself waiting forever during the shutdown command, but something about the state it ends up in.
I am having the same issue. Bazel doesn't exit after build
If you experience this please run jstack <pid>
where <pid>
is the PID of the Bazel server.
I am having the same issue in a container environment. I am trying to narrow down the repro steps. I am using steps as follow:
output_base
bazel shutdown
stderr |
prefix):stderr | WARNING: Running Bazel server needs to be killed, because the startup options are different.
stderr | WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 60)
stderr | WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 60)
stderr | WARNING: Waiting for server process to terminate (waited 30 seconds, waiting at most 60)
stderr | INFO: Waited 60 seconds for server process (pid=3292) to terminate.
stderr | WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 10)
stderr | WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 10)
stderr | INFO: Waited 10 seconds for server process (pid=3292) to terminate.
stderr | FATAL: Attempted to kill stale server process (pid=3292) using SIGKILL, but it did not die in a timely fashion.
The process 3292 is in defunct state and I cannot use jstack to dump the stacktrace:
root 3292 143 0.0 0 0 ? Zs 16:22 0:44 [java] <defunct>
Any hints for this behavior?
I am using 5.1.1
I am having the same issue in a container environment. I am trying to narrow down the repro steps. I am using steps as follow:
- run bazel inside the container, with customized
output_base
- run
bazel shutdown
- rerun step 1
- now I see logs like (ignore the
stderr |
prefix):stderr | WARNING: Running Bazel server needs to be killed, because the startup options are different. stderr | WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 60) stderr | WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 60) stderr | WARNING: Waiting for server process to terminate (waited 30 seconds, waiting at most 60) stderr | INFO: Waited 60 seconds for server process (pid=3292) to terminate. stderr | WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 10) stderr | WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 10) stderr | INFO: Waited 10 seconds for server process (pid=3292) to terminate. stderr | FATAL: Attempted to kill stale server process (pid=3292) using SIGKILL, but it did not die in a timely fashion.
The process 3292 is in defunct state and I cannot use jstack to dump the stacktrace:
root 3292 143 0.0 0 0 ? Zs 16:22 0:44 [java] <defunct>
Any hints for this behavior?
I am using 5.1.1
Look like my case (or if you are running bazel inside a container) is related to this init process issue: https://github.com/kubernetes/kubernetes/issues/84210
Just hit the same problem in our CI/CD pipeline. The problem was yes, the lack of an init process / child reaper.
What happens:
bazel shutdown
or any bazel command that requires killing/restarting the bazel daemon will use kill($serverPid)
to terminate the server.PID 1
is not a process that will reap children
(eg, waitpid for any child that dies), the bazel daemon with $serverPid
will remain as a zombie once killed. From the OS point of view, the process with $serverPid
will keep existing, both as a PID and as a file in /proc/$serverPid
until a parent waitpid
s on it.src/main/cpp/blaze_util_posix.cc
, the bazel command trying to kill the bazel servers keeps sending kill -TERM $serverPid
or kill -9 $serverPid
until ... the pid goes away from /proc/$serverPid
or until killd($serverPid, 0)
returns error (depending on platform).Solution/fix: in your container, use an entrypoint
that does child reaping. Eg, have PID 1 be /bin/docker-init
, /sbin/init
, or custom code. Alternatively, run something in the container that does child reaping via PR_SET_CHILD_SUBREAPER
, like /bin/docker-init -s
.
Is it possible to run a "sleep" and kill any pending bazel processes ?
On Sun, 10 Apr 2022, 22:08 hbc, @.***> wrote:
I am having the same issue in a container environment. I am trying to narrow down the repro steps. I am using steps as follow:
- run bazel inside the container, with customized output_base
- run bazel shutdown
- rerun step 1
- now I see logs like (ignore the stderr | prefix):
stderr | WARNING: Running Bazel server needs to be killed, because the startup options are different. stderr | WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 60) stderr | WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 60) stderr | WARNING: Waiting for server process to terminate (waited 30 seconds, waiting at most 60) stderr | INFO: Waited 60 seconds for server process (pid=3292) to terminate. stderr | WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 10) stderr | WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 10) stderr | INFO: Waited 10 seconds for server process (pid=3292) to terminate. stderr | FATAL: Attempted to kill stale server process (pid=3292) using SIGKILL, but it did not die in a timely fashion.
The process 3292 is in defunct state and I cannot use jstack to dump the stacktrace:
root 3292 143 0.0 0 0 ? Zs 16:22 0:44 [java]
Any hints for this behavior?
I am using 5.1.1
Look like my case (or if you are running bazel inside a container) is related to this init process issue: kubernetes/kubernetes#84210 https://github.com/kubernetes/kubernetes/issues/84210
— Reply to this email directly, view it on GitHub https://github.com/bazelbuild/bazel/issues/13823#issuecomment-1094309593, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABI6IMYHXYATROOH2ZO42HDVEL7XFANCNFSM5B3MNCPQ . You are receiving this because you were mentioned.Message ID: @.***>
So how did we solve this problem? I'm using a ci/cd system the jenkins + kubernetes plugin way.
The process of my jenkins agent looks like this
I ended up getting the following error.
+ make build-release
bin/bazel clean --expunge
(07:04:55) [32mINFO: [0mStarting clean (this may take a while). Consider using --async if the clean takes more than several minutes.
(07:04:55) [32mINFO: [0mClean command is running, shutting down worker pool...
[0mWARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 60)
WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 60)
WARNING: Waiting for server process to terminate (waited 30 seconds, waiting at most 60)
INFO: Waited 60 seconds for server process (pid=44) to terminate.
WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 10)
WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 10)
INFO: Waited 10 seconds for server process (pid=44) to terminate.
FATAL: Attempted to kill stale server process (pid=44) using SIGKILL, but it did not die in a timely fashion.
make: *** [build-release] Error 36
We've got the same issue while using Bazel in GitHub Actions to build/test a large C++ code base: we sometimes see the "Stop containers" step taking minutes or even >1h / running into a timeout after both bazel build
and bazel test
have already completed successfully. Usually "Stop containers" takes a few seconds, and this only happens maybe every 100 or 200 runs.
After finding this issue we've added '--init' to the 'docker run' invocation. That seems to fix the problem. However it's unclear what and why this was happening in the first place. Our legacy CMake build never caused the "Stop containers" step to hang.
After finding this issue we've added '--init' to the 'docker run' invocation. That seems to fix the problem. However it's unclear what and why this was happening in the first place. Our legacy CMake build never caused the "Stop containers" step to hang.
See the explanation on https://github.com/bazelbuild/bazel/issues/13823#issuecomment-1247177037 above. Tl;Dr: bazel shutdown
waits for the "pid to disappear", but if there is no "child reaper" (eg, init
, or something doing waitpid
on the dead daemon) in any unix system the pid will keep existing and sticking around (so the status code, error state, etc is not lost). From documentation, it looks like --init
in docker starts an init, which does child reaping for zombies.
We are having the same issue running inside jenkins docker with docker exec on amazon linux 2023. I don't think we can change jenkins agent configuration or configure docker inside jenkins pipeline (jenkinsfile) to add --init. Unfortunately python package tink is using bazel and with this issue opened for past 3 years it is really hard to build wheels ourselves. Is should be possible to run bazel without server/deamon mode or at least should be compatible with docker 2024 (out of the box).
running build_ext
bazel clean --expunge
Starting local Bazel server and connecting to it...
INFO: Starting clean (this may take a while). Consider using --async if the clean takes more than several minutes.
WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 60)
WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 60)
WARNING: Waiting for server process to terminate (waited 30 seconds, waiting at most 60)
INFO: Waited 60 seconds for server process (pid=292) to terminate.
WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 10)
WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 10)
INFO: Waited 10 seconds for server process (pid=292) to terminate.
FATAL: Attempted to kill stale server process (pid=292) using SIGKILL, but it did not die in a timely fashion.
error: command '/usr/bin/bazel' failed with exit code 36
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for tink
Running setup.py clean for tink
Failed to build tink
ERROR: Failed to build one or more wheels
[Pipeline] }
I am not sure whether this should be Bazel's business.
Next to docker --init
there are other ways to spawn a lightweight init in the container, e.g. https://github.com/phusion/baseimage-docker/blob/rel-0.9.16/image/bin/my_init or https://github.com/Yelp/dumb-init.
I don't know but with other build systems we don't have such problem. So it seems it is specific to how bazel works. I have managed to workaround jumping through some hoops to run with --init because I needed to include infra guys ... and now it works with the provided workaround.
https://docs.docker.com/config/containers/multi-service_container/
The container's main process is responsible for managing all processes that it starts.
In some cases, the main process isn't well-designed, and doesn't handle "reaping"
(stopping) child processes gracefully when the container exits. If your process falls
into this category, you can use the --init option when you run the container.
So would be nice to have this link in bazel docker docs or fix the issue and have better handling of child processes.
Description of the problem:
The
bazel buld
orbazel query
creates a stale bazel process even after the bazel build/query is completed. This prevents future invocation of other bazel commandsBugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
On Tekton task list we are following below commands
We are unable to stop these processes, As per this we added a
That didn't shut down any. We got this error:
The
bazel clean --expunge
also shows the same error.What operating system are you running Bazel on?
Redhat 7.9 Docker container running in K8S pod (as a Tekton task)
What's the output of
bazel info release
?Extracting Bazel installation... Starting local Bazel server and connecting to it... release 3.2.0
If
bazel info release
returns "development version" or "(@non-git)", tell us how you built Bazel.NA
What's the output of
git remote get-url origin ; git rev-parse master ; git rev-parse HEAD
?Is it required?
Have you found anything relevant by searching the web
Followed this thread Included the
command, but it didn't stop the existing bazel processes.
Any other information, logs, or outputs that you want to share?
Will share further if required.