cetic / helm-nifi

Helm Chart for Apache Nifi
Apache License 2.0
215 stars 225 forks source link

[cetic/nifi] pod stops only after terminationGracePeriodSeconds exceeded #239

Closed gforeman02 closed 2 years ago

gforeman02 commented 2 years ago

Describe the bug When a pod in a cluster is triggered to stop, the offload node process can take some time to complete. Increasing the terminationGracePeriodSeconds to 600 provides enough time for the server container to complete offloading with a generous buffer (in my environment). While the Nifi server container does stop, the sidecar containers continue to run. The result is the pod does not terminate until the terminationGracePeriodSeconds passes.

Version of Helm, Kubernetes and the Nifi chart: Helm: 3.5.1 Kubernetes: 1.20.5 Nifi chart: 1.0.5

What happened: See description.

What you expected to happen: The termination of the pod should complete when the Nifi server container stops. The terminationGracePeriodSeconds should be a last resort to terminate the pod.

How to reproduce it (as minimally and precisely as possible): Deploy a Nifi cluster with terminationGracePeriodSeconds to 600. Reduce the number of nodes in the cluster and note the time. Monitor the Nifi server container log to ensure offload processes complete. Monitor how long it takes for the pod to terminate. Additionally, redeploy the cluster with the sidecars commented out and scale down the node again. The pod terminates after the Nifi server container terminates vs. waiting until the terminationGracePeriodSeconds to complete.

Anything else we need to know: None

weltonrodrigo commented 2 years ago

I believe this is a duplicate of #158.

This is caused by log tailer sidecars, which for some reason refuse to quit on SIGTERM. I investigated that but could not found a quick solution.

Maybe we need other software besides tail, which respects SIGTERM. tail seems to ignore it when running in the +F mode.

gforeman02 commented 2 years ago

@weltonrodrigo agreed, these appear to be the same issue.

I just submitted a PR that resolved the issue in my local environment. I tested with a single instance and a 3 node cluster and it removed the pod well before the grace period.