bazelbuild / bazel

a fast, scalable, multi-language and extensible build system
https://bazel.build
Apache License 2.0
23.03k stars 4.03k forks source link

Bazel clean --expunge or Bazel shutdown unable to kill stale bazel processes #13823

Open uajith opened 3 years ago

uajith commented 3 years ago

Description of the problem:

The bazel buld or bazel query creates a stale bazel process even after the bazel build/query is completed. This prevents future invocation of other bazel commands

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

On Tekton task list we are following below commands

  1. bazel query //... (or a list of targets)
  2. Once the query is completed, we are still seeing a bazel process and its child process seen running
    jenkins    2064      1 47 04:49 ?        00:03:07 bazel(directory) -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/home/jenkins/.cache/bazel/_bazel_jenkins/41b4626fb6512837d24f630cb1632ba8 --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.lang=ALL-UNNAMED -Xverify:none -Djava.util.logging.config.file=/home/jenkins/.cache/bazel/_bazel_jenkins/41b4626fb6512837d24f630cb1632ba8/javalog.properties -Dcom.google.devtools.build.lib.util.LogHandlerQuerier.class=com.google.devtools.build.lib.util.SimpleLogHandler$HandlerQuerier -XX:-MaxFDLimit -Djava.library.path=/home/jenkins/.cache/bazel/_bazel_jenkins/install/ba7765e6f39a679257358196b530585b/embedded_tools/jdk/lib/jli:/home/jenkins/.cache/bazel/_bazel_jenkins/install/ba7765e6f39a679257358196b530585b/embedded_tools/jdk/lib:/home/jenkins/.cache/bazel/_bazel_jenkins/install/ba7765e6f39a679257358196b530585b/embedded_tools/jdk/lib/server:/home/jenkins/.cache/bazel/_bazel_jenkins/install/ba7765e6f39a679257358196b530585b/ -Dfile.encoding=ISO-8859-1 -jar /home/jenkins/.cache/bazel/_bazel_jenkins/install/ba7765e6f39a679257358196b530585b/A-server.jar --max_idle_secs=10800 --noshutdown_on_low_sys_mem --connect_timeout_secs=120 --output_user_root=/home/jenkins/.cache/bazel/_bazel_jenkins --install_base=/home/jenkins/.cache/bazel/_bazel_jenkins/install/ba7765e6f39a679257358196b530585b --install_md5=ba7765e6f39a679257358196b530585b --output_base=/home/jenkins/.cache/bazel/_bazel_jenkins/41b4626fb6512837d24f630cb1632ba8 --workspace_directory=/home/jenkins/13518/directory --default_system_javabase=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-0.el7_8.x86_64 --failure_detail_out=/home/jenkins/.cache/bazel/_bazel_jenkins/41b4626fb6512837d24f630cb1632ba8/failure_detail.rawproto --deep_execroot --expand_configs_in_place --idle_server_tasks --write_command_log --nowatchfs --nofatal_event_bus_exceptions --nowindows_enable_symlinks --client_debug=false --product_name=Bazel --noincompatible_enable_execution_transition --option_sources=connect_Utimeout_Usecs:/home/jenkins/13518/directory/.bazelrc:max_Uidle_Usecs:/home/jenkins/13518/directory/.bazelrc
    jenkins   11863   2064 62 04:52 ?        00:02:13 /home/jenkins/.cache/bazel/_bazel_jenkins/41b4626fb6512837d24f630cb1632ba8/execroot/com_ibm_monorepo/external/remotejdk11_linux/bin/java -XX:+UseParallelOldGC -XX:-CompactStrings --add-exports=jdk.compiler/com.sun.tools.javac.api=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.code=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.comp=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.main=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.tree=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.util=ALL-UNNAMED --add-opens=jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED --patch-module=java.compiler=external/remote_java_tools_linux/java_tools/java_compiler.jar --patch-module=jdk.compiler=external/remote_java_tools_linux/java_tools/jdk_compiler.jar --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.lang=ALL-UNNAMED -jar external/remote_java_tools_linux/java_tools/JavaBuilder_deploy.jar --persistent_worker
    jenkins   11866   2064 51 04:52 ?        00:01:50 /home/jenkins/.cache/bazel/_bazel_jenkins/41b4626fb6512837d24f630cb1632ba8/execroot/com_ibm_monorepo/external/remotejdk11_linux/bin/java -XX:+UseParallelOldGC -XX:-CompactStrings --add-exports=jdk.compiler/com.sun.tools.javac.api=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.code=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.comp=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.main=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.tree=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.util=ALL-UNNAMED --add-opens=jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED --patch-module=java.compiler=external/remote_java_tools_linux/java_tools/java_compiler.jar --patch-module=jdk.compiler=external/remote_java_tools_linux/java_tools/jdk_compiler.jar --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.lang=ALL-UNNAMED -jar external/remote_java_tools_linux/java_tools/JavaBuilder_deploy.jar --persistent_worker
    jenkins   11877   2064 66 04:52 ?        00:02:21 /home/jenkins/.cache/bazel/_bazel_jenkins/41b4626fb6512837d24f630cb1632ba8/execroot/com_ibm_monorepo/external/remotejdk11_linux/bin/java -XX:+UseParallelOldGC -XX:-CompactStrings --add-exports=jdk.compiler/com.sun.tools.javac.api=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.code=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.comp=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.main=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.tree=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.util=ALL-UNNAMED --add-opens=jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED --patch-module=java.compiler=external/remote_java_tools_linux/java_tools/java_compiler.jar --patch-module=jdk.compiler=external/remote_java_tools_linux/java_tools/jdk_compiler.jar --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.lang=ALL-UNNAMED -jar external/remote_java_tools_linux/java_tools/JavaBuilder_deploy.jar --persistent_worker
    jenkins   11879   2064 59 04:52 ?        00:02:07 /home/jenkins/.cache/bazel/_bazel_jenkins/41b4626fb6512837d24f630cb1632ba8/execroot/com_ibm_monorepo/external/remotejdk11_linux/bin/java -XX:+UseParallelOldGC -XX:-CompactStrings --add-exports=jdk.compiler/com.sun.tools.javac.api=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.code=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.comp=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.main=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.tree=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.util=ALL-UNNAMED --add-opens=jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED --patch-module=java.compiler=external/remote_java_tools_linux/java_tools/java_compiler.jar --patch-module=jdk.compiler=external/remote_java_tools_linux/java_tools/jdk_compiler.jar --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.lang=ALL-UNNAMED -jar external/remote_java_tools_linux/java_tools/JavaBuilder_deploy.jar --persistent_worker
    jenkins   16288   1993  0 04:56 ?        00:00:00 grep bazel

    We are unable to stop these processes, As per this we added a

    bazel shutdown

    That didn't shut down any. We got this error:

    WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 60)
    WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 60)
    WARNING: Waiting for server process to terminate (waited 30 seconds, waiting at most 60)
    INFO: Waited 60 seconds for server process (pid=2064) to terminate.
    WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 10)
    WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 10)
    INFO: Waited 10 seconds for server process (pid=2064) to terminate.
    FATAL: Attempted to kill stale server process (pid=2064) using SIGKILL, but it did not die in a timely fashion.

    The bazel clean --expunge also shows the same error.

What operating system are you running Bazel on?

Redhat 7.9 Docker container running in K8S pod (as a Tekton task)

What's the output of bazel info release?

Extracting Bazel installation... Starting local Bazel server and connecting to it... release 3.2.0

If bazel info release returns "development version" or "(@non-git)", tell us how you built Bazel.

NA

What's the output of git remote get-url origin ; git rev-parse master ; git rev-parse HEAD ?

Is it required?

Have you found anything relevant by searching the web

Followed this thread Included the

bazel shutdown  

command, but it didn't stop the existing bazel processes.

Any other information, logs, or outputs that you want to share?

Will share further if required.

meisterT commented 3 years ago

Do you have an easy way to reproduce this?

uajith commented 3 years ago

No other way I could see.

clemente0731 commented 3 years ago

meet the same problem @uajith @meisterT

meisterT commented 3 years ago

@clemente0420 in a different environment?

clemente0731 commented 3 years ago

@clemente0420 in a different environment?

centos docker in ubuntu 1604 x86 host,use bazelisk clean ,then happens

clemente0731 commented 3 years ago

@uajith @meisterT got problem, check your host zombie process

uajith commented 3 years ago

Ok, it works.. Looks like the zombie process is causing the problem.

philwo commented 3 years ago

Can you confirm whether you're still seeing real (non-zombies) Bazel processes that you can't get rid off?

If these are zombie processes, I think it would help to run an init process inside your container (with Docker we like to use docker run --init for this purpose).

jan-kiszka commented 2 years ago

I just ran into the same issue, also inside a container env that we use for all kinds of build processes, including bitbake which has a client-server structure as well. Simply issuing bazel and then bazel shutdown exposes the issue, adding the otherwise unneeded --init to the container env works around it.

What makes only bazel stumble here? Can't this be resolved differently?

larsrc-google commented 2 years ago

What does the server log say during the shutdown (use bazel info | grep server_log before shutdown to find the log)?

jan-kiszka commented 2 years ago

This is with bazel-bootstrap from Debian bullseye: java.log.90be0d0794f5.builder.log.java.20220110-172052.2484.txt

I don't find messages being added when shutdown is invoked, though.

larsrc-google commented 2 years ago

It does indicate that it finishes the shutdown command within a fraction of a second. So it's not that something within Bazel itself waiting forever during the shutdown command, but something about the state it ends up in.

aminya commented 2 years ago

I am having the same issue. Bazel doesn't exit after build

meisterT commented 2 years ago

If you experience this please run jstack <pid> where <pid> is the PID of the Bazel server.

bcho commented 2 years ago

I am having the same issue in a container environment. I am trying to narrow down the repro steps. I am using steps as follow:

  1. run bazel inside the container, with customized output_base
  2. run bazel shutdown
  3. rerun step 1
  4. now I see logs like (ignore the stderr | prefix):
stderr |  WARNING: Running Bazel server needs to be killed, because the startup options are different.
stderr |  WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 60)
stderr |  WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 60)
stderr |  WARNING: Waiting for server process to terminate (waited 30 seconds, waiting at most 60)
stderr |  INFO: Waited 60 seconds for server process (pid=3292) to terminate.
stderr |  WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 10)
stderr |  WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 10)
stderr |  INFO: Waited 10 seconds for server process (pid=3292) to terminate.
stderr |  FATAL: Attempted to kill stale server process (pid=3292) using SIGKILL, but it did not die in a timely fashion.

The process 3292 is in defunct state and I cannot use jstack to dump the stacktrace:

root      3292  143  0.0      0     0 ?        Zs   16:22   0:44 [java] <defunct>

Any hints for this behavior?

I am using 5.1.1

bcho commented 2 years ago

I am having the same issue in a container environment. I am trying to narrow down the repro steps. I am using steps as follow:

  1. run bazel inside the container, with customized output_base
  2. run bazel shutdown
  3. rerun step 1
  4. now I see logs like (ignore the stderr | prefix):
stderr |  WARNING: Running Bazel server needs to be killed, because the startup options are different.
stderr |  WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 60)
stderr |  WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 60)
stderr |  WARNING: Waiting for server process to terminate (waited 30 seconds, waiting at most 60)
stderr |  INFO: Waited 60 seconds for server process (pid=3292) to terminate.
stderr |  WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 10)
stderr |  WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 10)
stderr |  INFO: Waited 10 seconds for server process (pid=3292) to terminate.
stderr |  FATAL: Attempted to kill stale server process (pid=3292) using SIGKILL, but it did not die in a timely fashion.

The process 3292 is in defunct state and I cannot use jstack to dump the stacktrace:

root      3292  143  0.0      0     0 ?        Zs   16:22   0:44 [java] <defunct>

Any hints for this behavior?

I am using 5.1.1

Look like my case (or if you are running bazel inside a container) is related to this init process issue: https://github.com/kubernetes/kubernetes/issues/84210

ccontavalli commented 2 years ago

Just hit the same problem in our CI/CD pipeline. The problem was yes, the lack of an init process / child reaper.

What happens:

  1. bazel shutdown or any bazel command that requires killing/restarting the bazel daemon will use kill($serverPid) to terminate the server.
  2. In a container, be it a k8 or plain docker, if PID 1 is not a process that will reap children (eg, waitpid for any child that dies), the bazel daemon with $serverPid will remain as a zombie once killed. From the OS point of view, the process with $serverPid will keep existing, both as a PID and as a file in /proc/$serverPid until a parent waitpids on it.
  3. As per code in src/main/cpp/blaze_util_posix.cc, the bazel command trying to kill the bazel servers keeps sending kill -TERM $serverPid or kill -9 $serverPid until ... the pid goes away from /proc/$serverPid or until killd($serverPid, 0) returns error (depending on platform).
  4. Given that there is no child reaper, no init process... the zombie sticks around forever, the pid never goes away, and the command trying to kill bazel thinks the process is still running until eventually times out with the error in this bug.

Solution/fix: in your container, use an entrypoint that does child reaping. Eg, have PID 1 be /bin/docker-init, /sbin/init, or custom code. Alternatively, run something in the container that does child reaping via PR_SET_CHILD_SUBREAPER, like /bin/docker-init -s.

uajith commented 1 year ago

Is it possible to run a "sleep" and kill any pending bazel processes ?

On Sun, 10 Apr 2022, 22:08 hbc, @.***> wrote:

I am having the same issue in a container environment. I am trying to narrow down the repro steps. I am using steps as follow:

  1. run bazel inside the container, with customized output_base
  2. run bazel shutdown
  3. rerun step 1
  4. now I see logs like (ignore the stderr | prefix):

stderr | WARNING: Running Bazel server needs to be killed, because the startup options are different. stderr | WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 60) stderr | WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 60) stderr | WARNING: Waiting for server process to terminate (waited 30 seconds, waiting at most 60) stderr | INFO: Waited 60 seconds for server process (pid=3292) to terminate. stderr | WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 10) stderr | WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 10) stderr | INFO: Waited 10 seconds for server process (pid=3292) to terminate. stderr | FATAL: Attempted to kill stale server process (pid=3292) using SIGKILL, but it did not die in a timely fashion.

The process 3292 is in defunct state and I cannot use jstack to dump the stacktrace:

root 3292 143 0.0 0 0 ? Zs 16:22 0:44 [java]

Any hints for this behavior?

I am using 5.1.1

Look like my case (or if you are running bazel inside a container) is related to this init process issue: kubernetes/kubernetes#84210 https://github.com/kubernetes/kubernetes/issues/84210

— Reply to this email directly, view it on GitHub https://github.com/bazelbuild/bazel/issues/13823#issuecomment-1094309593, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABI6IMYHXYATROOH2ZO42HDVEL7XFANCNFSM5B3MNCPQ . You are receiving this because you were mentioned.Message ID: @.***>

JokerDevops commented 11 months ago

So how did we solve this problem? I'm using a ci/cd system the jenkins + kubernetes plugin way.

The process of my jenkins agent looks like this

image

I ended up getting the following error.

+ make build-release
bin/bazel clean --expunge
(07:04:55) INFO: Starting clean (this may take a while). Consider using --async if the clean takes more than several minutes.
(07:04:55) INFO: Clean command is running, shutting down worker pool...
WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 60)
WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 60)
WARNING: Waiting for server process to terminate (waited 30 seconds, waiting at most 60)
INFO: Waited 60 seconds for server process (pid=44) to terminate.
WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 10)
WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 10)
INFO: Waited 10 seconds for server process (pid=44) to terminate.
FATAL: Attempted to kill stale server process (pid=44) using SIGKILL, but it did not die in a timely fashion.
make: *** [build-release] Error 36
nagelp-bosch commented 10 months ago

We've got the same issue while using Bazel in GitHub Actions to build/test a large C++ code base: we sometimes see the "Stop containers" step taking minutes or even >1h / running into a timeout after both bazel build and bazel test have already completed successfully. Usually "Stop containers" takes a few seconds, and this only happens maybe every 100 or 200 runs.

After finding this issue we've added '--init' to the 'docker run' invocation. That seems to fix the problem. However it's unclear what and why this was happening in the first place. Our legacy CMake build never caused the "Stop containers" step to hang.

ccontavalli commented 10 months ago

After finding this issue we've added '--init' to the 'docker run' invocation. That seems to fix the problem. However it's unclear what and why this was happening in the first place. Our legacy CMake build never caused the "Stop containers" step to hang.

See the explanation on https://github.com/bazelbuild/bazel/issues/13823#issuecomment-1247177037 above. Tl;Dr: bazel shutdown waits for the "pid to disappear", but if there is no "child reaper" (eg, init, or something doing waitpid on the dead daemon) in any unix system the pid will keep existing and sticking around (so the status code, error state, etc is not lost). From documentation, it looks like --init in docker starts an init, which does child reaping for zombies.

matejsp commented 4 months ago

We are having the same issue running inside jenkins docker with docker exec on amazon linux 2023. I don't think we can change jenkins agent configuration or configure docker inside jenkins pipeline (jenkinsfile) to add --init. Unfortunately python package tink is using bazel and with this issue opened for past 3 years it is really hard to build wheels ourselves. Is should be possible to run bazel without server/deamon mode or at least should be compatible with docker 2024 (out of the box).

      running build_ext
      bazel clean --expunge
      Starting local Bazel server and connecting to it...
      INFO: Starting clean (this may take a while). Consider using --async if the clean takes more than several minutes.
      WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 60)
      WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 60)
      WARNING: Waiting for server process to terminate (waited 30 seconds, waiting at most 60)
      INFO: Waited 60 seconds for server process (pid=292) to terminate.
      WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 10)
      WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 10)
      INFO: Waited 10 seconds for server process (pid=292) to terminate.
      FATAL: Attempted to kill stale server process (pid=292) using SIGKILL, but it did not die in a timely fashion.
      error: command '/usr/bin/bazel' failed with exit code 36
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for tink
  Running setup.py clean for tink
Failed to build tink
ERROR: Failed to build one or more wheels
[Pipeline] }
meisterT commented 4 months ago

I am not sure whether this should be Bazel's business.

Next to docker --init there are other ways to spawn a lightweight init in the container, e.g. https://github.com/phusion/baseimage-docker/blob/rel-0.9.16/image/bin/my_init or https://github.com/Yelp/dumb-init.

matejsp commented 4 months ago

I don't know but with other build systems we don't have such problem. So it seems it is specific to how bazel works. I have managed to workaround jumping through some hoops to run with --init because I needed to include infra guys ... and now it works with the provided workaround.

https://docs.docker.com/config/containers/multi-service_container/

The container's main process is responsible for managing all processes that it starts. 
In some cases, the main process isn't well-designed, and doesn't handle "reaping"
(stopping) child processes gracefully when the container exits. If your process falls 
into this category, you can use the --init option when you run the container.

So would be nice to have this link in bazel docker docs or fix the issue and have better handling of child processes.