berops / claudie

Cloud-agnostic managed Kubernetes
https://docs.claudie.io/
Apache License 2.0
601 stars 40 forks source link

Chore: Verify SIGTERM handling behavior #1291

Closed bernardhalas closed 7 months ago

bernardhalas commented 7 months ago

Description

By default, SIGTERM is an important signal for the processes which represents a request for graceful shutdown. It's used by Kubernetes whenever there's a pod termination requested (e.g. pod eviction, pod restarts upon liveness probe failure, etc.).

We need to ensure that Claudie services react properly to SIGTERM signal on all levels. This means:

  1. Ensure that all Docker images propagate SIGTERM to the running processes (namely PID 0 within container). Tini can be used for that purpose.
  2. Services themselves respect SIGTERM.
  3. terraformer, upon receiving SIGTERM, should issue exactly one SIGTERM to the terraform process (because multiple SIGTERM or SIGINT signals to terraform process represent a request for a non-graceful shutdown; ref <<-- I couldn't find the src code, but at least referring to the issues that confirm this behavior)

Exit criteria

bernardhalas commented 7 months ago

As agreed, during today's grooming, passing directly to Jakub.

JKBGIT1 commented 7 months ago

Ensure all Claudie images forward SIGTERM do processes

All Claudie images forward SIGTERM to main the process of the container, but not to the child processes of this process.

Services respect SIGTERM (if not, a new Issue can be opened to services which don't respect SIGTERM)

I would say that services respect SIGTERM partially. Since the workflow heavy lifting is performed by the child processes of the Claudie services, the Claudie services have to wait until the child processes finish (No SIGTERM is forwarded to the child processes), and then the Claudie processes terminate.

terraformer, upon receiving SIGTERM, should issue exactly one SIGTERM to the terraform process (because multiple SIGTERM or SIGINT signals to terraform process represent a request for a non-graceful shutdown;

SIGTERM isn't forwarded to the terraform process, because it is a child process of the terraformer main container process.

Also, if you SIGTERM terraform process it is shutdown gracefully, but the terraformer re-run the terraform process as part of the retry loop. In my case it broke the TF state (error regarding already existing resource) and the workflow failed.

JKBGIT1 commented 7 months ago

I'm closing this issue in favor of https://github.com/berops/claudie/issues/1298

We had a discussion about the current way of handling the SIGTERM calls. To summarize, Claudie's services listen to SIGTERM calls. The SIGTERM is forwarded to the main process of the container, then the call is handled programmatically by us. Besides that, we don't want to forward this call to the child processes of the main container process.

EDIT: @bernardhalas please complement me if there is something I missed or worth mentioning.