Closed frankie-lim-partior closed 2 years ago
You're right the SIGTERM
isn't passed through to child process's. But this is more of a problem with the way you're calling geth. You should be able to call:
exec /usr/local/bin/geth \
--datadir $QUORUM_DATA_DIR ...
and it should pass the signal along. You can take a look at
for inspiration (previously this was suffering from the same problem)
The other thing to be cautious of is piping after the command as this may have an impact on which process has pid 1
Whilst using bash might mitigate the problem, the correct solution is to ensure that geth is running on pid 1.
Closing as no response. Feel free to re-open.
Thanks @antonydenyer. Apology for the late reply, was on leave. will test this.
@antonydenyer I have tried out your suggestion, to run the geth using exec.
exec /usr/local/bin/geth \
--datadir $QUORUM_DATA_DIR ...
I have then confirmed the within the pod, geth is running with PID 1.
However, when I try to kubectl delete the pod, where k8s send SIGTERM to the pod, it still didnt trigger geth to shutdown gracefully. I can still see unlcean shutdown in the logs.
Anything else you can think of that could solve this?
if the pod is under heavy load you might need to increase the terminationGracePeriodSeconds
.
What do you get in the logs after you call delete? You should be getting something like "Got interrupt, shutting down..."
@antonydenyer thanks. I am now able to pass the SIGTERM and have the pod shutdown gracefully as suggested. Thanks for the solution.
System information
Geth version: v22.4.1 OS & Version: Linux, Docker, Kubernetes, GKE quorumengineering/quorum:22.4.1
Expected behaviour
When GoQuorum container terminate/exit in Kubernetes, GoQuorum Geth process should shutdown gracefully and show logs something similar to below.
Actual behaviour
When GoQuorum container terminate/exit in Kubernetes, GoQuorum Geth process did not shutdown gracefully as it did not receive the SIGTERM or SIGINT to shutdown gracefully. As a result, once Kubernetes reached terminationGracePeriod expiry, Kubernetes will kill geth process / container forcefully resulting in ‘Unclean shutdown detected’ in GoQuorum log.
Steps to reproduce the behaviour
Possible Root cause
https://pracucci.com/graceful-shutdown-of-kubernetes-pods.html The container shell form runs the command with /bin/sh -cx /usr/local/bin/geth .. , so the process that will get the SIGTERM is actually /bin/sh and not its child geth. The shell shipped by default with Alpine Linux (used in the Quorum docker image) does not pass signals to children (in this case geth). As such, when the container / pod is terminated, geth did not receive the graceful termination signal, and by the end of the terminationGracePeriod, K8s will forcefully kill geth resulting in an unclean shutdown.
Suggestion
Bash shell will be able to pass termination signal to child process. Suggest to include bash shell in the Quorum Docker docker image, so that we can start geth in container using bash rather than sh.
Alternative
We have tested using k8s lifecycle preStop hook and try to insert kill, however, it does not seems to be able to execute the preStop hook to kill geth.
Executing the kill command direction in a pod exec do work, and the geth shutdown gracefully.
Backtrace