GoogleCloudPlatform / airflow-operator

Kubernetes custom controller and CRDs to managing Airflow
Apache License 2.0
299 stars 68 forks source link

Advice - handling deployments whilst a DAG is running #13

Open darrenhaken opened 6 years ago

darrenhaken commented 6 years ago

First of all, I am thrilled you're working on this operator! Also, great work on Composer.

I was wondering if anyone was prepared to discuss achieving DAG reliability whilst a component is being deployed on Airflow. As Kubernetes routinely can schedule pods I imagined it requires higher reliability from DAGs.

When using Airflow to say run a Spark job on Dataproc, what would happen to a DAG run if a restart were to happen? Do you have any advice on improving reliability?

Please feel free to reply and talk offline if that's useful. Hopefully you can provide input

barney-s commented 6 years ago

By DAG reliability do you mean what happens if i restart pods (mapping to airflow tasks) often. My understanding is airflow Tasks are meant to be designed to be idempotent (no side effects). That would take care of unreliable pod scheduling even with celery/worker.

But for cases where you mentioned i would like to loop in a bit more folks to advise. @liyinan926 @dimberman

darrenhaken commented 6 years ago

@barney-s thanks for the answer. I wasn't sure if the entire DAG failed or simply the task; it looks like its the task.

Any other info im interested in hearing about

darrenhaken commented 6 years ago

Hi quickly want to pick this back up to get a bit more detail from you.

So let's say I have a task running during a DAG run, for example running a remote Spark job. Would you expect the task to fail as a deployment would cause the Pod to be replaced with a new instance during a deployment?

The task would then retry and the Spark job resubmit?

dimberman commented 6 years ago

@darrenhaken if the pod running the spark job fails and your using the k8s executor it will report as a failure of the task. It would be difficult to have a pod come back up and somehow recreate task-level state since we don't have access to that task level information.

gegnew commented 4 years ago

Hi, sorry to necropost a bit here, but I recently had an issue where a redeployment of our Airflow service caused a DAG to hang. A redeployment (via terraform on ECS) occurred during a DAG run, and for some reason that DAG run was never marked as "failed", but was marked as "running" after the redeployment, even though nothing was, in fact, running. Since our DAG was configured without allowing parallel runs, this stopped any more DAG runs from starting.

Any thoughts about how to prevent this?