Open IKholopov opened 10 months ago
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.
@IKholopov Thanks for reporting the issue. Looks like its a bug. It requires further investigation.
Hello Team! Now I am investigating this issue and then I will try to prepare a fix for this.
Apache Airflow Provider(s)
google
Versions of Apache Airflow Providers
apache-airflow-providers-google==10.12.0
Apache Airflow version
2.6.3
Operating System
Ubuntu 20.04.6
Deployment
Other
Deployment details
N/A
What happened
In a DAG with ~500 GkeStartPodOperator tasks (running pods on another cluster, hosted on GKE) we discovered that operator execution hangs polling logs in ~0.2% of the task instances. Based on logs, the execution halts in the call inside kubernetes client (
read_namespaced_pod_log
to be exact).Only after the DAG run timeout (hours later), when SIGTERM is dispatched to the
task run
process, execution resumes, attempts to retry to fetch logs and pod status, but those have already been garbage collected.This looks exactly like https://github.com/kubernetes-client/python/issues/1234#issuecomment-695801558. After running the same deployment in the deferred mode, 1 task also ended up being locked up in a similar way, this time for another call (for creation):
I believe this is specific to GkeStartPodOperator, as KubernetesHook does have the mechanism ensuring TCP keep alive is configured by default: https://github.com/apache/airflow/blob/1d5d5022b8fc92f23f9fdc3b61269e5c7acfaf39/airflow/providers/cncf/kubernetes/hooks/kubernetes.py#L216, but GKEPodHook does not: https://github.com/apache/airflow/blob/1d5d5022b8fc92f23f9fdc3b61269e5c7acfaf39/airflow/providers/google/cloud/hooks/kubernetes_engine.py#L390
What you think should happen instead
GKEPodHook should reuse the same socket configuration used in KubernetesHook and configure TCP Keepalive (unless disabled).
How to reproduce
Run ~500 tasks on GKE with spot VMs. There is no reliable repro, but the problem has been clearly documented before and fixed for CNCF-k8s provider: https://github.com/apache/airflow/pull/11406.
Anything else
No response
Are you willing to submit PR?
Code of Conduct