apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.69k stars 14.2k forks source link

KPO pod completes before pod_manager.py is able finish streaming pod logs #42923

Open andrew-stein-sp opened 5 days ago

andrew-stein-sp commented 5 days ago

Apache Airflow version

Other Airflow 2 version (please specify below)

If "Other Airflow 2 version" selected, which one?

2.10.1

What happened?

We have a particular task that spawns a pod that is scheduled by itself to a GPU node. What I've observed is that occasionally, the k8s pod reaches a status of "Completed" before pod_manager.py finishes streaming logs from said pod to Airflow's logging system. Yes, this task produces A LOT of logs.

Sometimes it can take 2 minutes or more for pod_manager.py to actually catch up. The problem happens when AWS karpenter reclaims the GPU node before it finishes streaming the pod logs to airflow, resulting in a 404 error from the kubernetes API like the one below, and the airflow task being failed.

kubernetes.client.exceptions.ApiException: (404)

We're already using the karpenter.sh/do-not-disrupt:true annotation, but that isn't effective at preventing node reclamation once the pod reaches a state of "Completed".

We've been able to get around this for now by setting get_logs=false for that particular task, however we shouldn't have to do that.

The other possibility would be to increase the amount of time before karpenter can reclaim a node, but again, these are gpu nodes, so they're expensive and we run 300k tasks a day in just 1 of 8 regions we have airflow deployed to.

What you think should happen instead?

pod_manager.py should be able to set the airflow task to "success" once the pod reaches a state of "Completed" and then it can continue to stream logs to Airflow under a "Best Effort" basis. In other words, if there is a kube api error received while getting pod logs AFTER the pod has reached a "Completed" state, then those errors should be ignored.

Perhaps there is a way to improve log stream performance? If so, I'd love to know how.

How to reproduce

create a pod that produces a lot of logs and then kill the eks node within a minute or 2 of pod reaching a state of "completed"

Operating System

Official Airflow Image on python3.10 (debian)

Versions of Apache Airflow Providers

we don't have any providers pinned, so whatever versions ship with 2.10.1

Deployment

Official Apache Airflow Helm Chart

Deployment details

Deployed via ArgoCD.

Anything else?

No response

Are you willing to submit PR?

Code of Conduct

boring-cyborg[bot] commented 5 days ago

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

potiuk commented 2 days ago

Can you please try upgrading the provider to https://pypi.org/project/apache-airflow-providers-cncf-kubernetes/9.0.0rc1/ - there were some fixes to log streaming implemented in the latest version that is about to be relased.

andrew-stein-sp commented 1 day ago

I'll get an image built in our dev environment to test and let you know.