Closed maciejmochol closed 1 year ago
@maciejmochol The actual error is not related to the FailedPreStopHook
event. That is simply the pre-stop hook you enabled with ...gracefullTermination
waiting 600 seconds for jobs to finish before killing the pod (which in this case, was not enough time for whatever tasks were running to finish).
The actual liveness probe is failing with the error OCI runtime exec failed: exec failed: unable to start container process: error executing setns process: exit status 1: unknown
, which looks like something is wrong with that Kubernetes Node in your cluster, preventing the kubelet from executing commands in the container.
Questions:
kubectl exec -it ...
to manually run commands Pods? (Especially check with Pods on the Nodes that were running these workers)I would try:
hello Mathew,
thanks for your prompt answer. Yes we are setting this an all nodes (we use Autopilot mode of GCP) and yes I was able to get shell from inside the worker pod(s).
I don't think our Nodes are over-provisioned as i assume in GCP Autopilot it should not be happening. But it seems you may be right in point (2) - after we disabled liveness probe the POD was getting killed by this OOMKiller beast (aka error 137). So we increased RAM for pod and workers are not crashing anymore. After some time we will enable liveness probe back to see if it all works.
But this OOM is actually worrying me. I had previously reduced AIRFLOWCELERYWORKER_CONCURRENCY down to 5 (from 16 which i think was the default?) and it worked fine until now where all of a sudden it seems workers started consuming more memory than usual. Is there a way to prevent Python processes from consuming too much memory?
That said i believe this is outside of a bug i raised earlier - thank you for help.
Checks
User-Community Airflow Helm Chart
.Chart Version
8.8.0
Kubernetes Version
Helm Version
Description
Dear Community,
Our airflow started crashing - it seems GCP kills it due to failed liveness probe. 'POD describe' command outputs error related to the code from helm chart:
https://github.com/airflow-helm/charts/blob/c4ade4e5820cd9f3e8d26d6640da04fdff33b9c6/charts/airflow/templates/worker/worker-statefulset.yaml#L142
Partial output of 'kubectl pod describe airflow-worker-0':
Relevant Logs
No response
Custom Helm Values
No response