Open aptenodytes-forsteri opened 6 months ago
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.
Thanks for sharing this @aptenodytes-forsteri. I was asking on the google issuetracker about how to do this to simulate what is available on a GCP Composer cluster e.g. /home/airflow/gcs/data
. Someone on that issue mentioned they got it to work. Then I did. And, as I was going about looking to see if I should post a PR/issue here I stumbled upon this issue.
It's not the exact same but I imagine more and more folks want to do this and will want to mount multiple buckets/volumes.
Apache Airflow Provider(s)
cncf-kubernetes
Versions of Apache Airflow Providers
No response
Apache Airflow version
airflow-2.7.3
Operating System
linux
Deployment
Google Cloud Composer
Deployment details
No response
What happened
I created a dag with a KubernetesPodOperator which uses the annotations to mount GCS buckets using the Google Cloud Storage FUSE Container Storage Interface (CSI) Plugin.
Due to what I believe to be another bug, https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/issues/257, the GCS FUSE sidecar would sometimes stay running.
From log messages, I determined the operator was stuck in an infinite loop here: https://github.com/apache/airflow/blob/029cbaec174b73370e7c4ef2d7ec76e7be333400/airflow/providers/cncf/kubernetes/utils/pod_manager.py#L623
This appears to be a somewhat known issue, as the above function seems to have a special case for the istio sidecar. The same treatment should hold for all sidecars.
What you think should happen instead
It should not be possible for a "rogue" sidecar container to cause the KubernetesPodOperator to end up in an infinite loop. This behavior ends up hogging resources on a cluster and eventually clogs up the whole cluster with zombie pods.
One possible fix would be an optional timeout for how long to wait for the pod to do its work.
Another possible fix would be to generalize the treatment for the istio sidecar for all types of sidecars.
How to reproduce
The container runs a simple python script that writes 100 files to /out
Anything else
I can work around the issue with:
Are you willing to submit PR?
Code of Conduct