apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.3k stars 14.09k forks source link

Add job_id to DagRun table and remove BackfillJob zombies #11302

Open turbaszek opened 3 years ago

turbaszek commented 3 years ago

Description

Currently, we have run_type in the DagRun table but there's no way to determine what job created a DagRun (no 1-1 relation between DagRun and Job). This can be helpful in debugging (I think especially in the case of Scheduler HA and BackfillJobs) as this will also allow users to check which scheduler / backfill job triggered their task (now the job_id in TaskInstance is always id of LocalTaskJob).

Introducing job_id may help us with making backfill runnable remotely as we will be able to clean up after jobs that failed thus reducing possible zombies.

The problem

Run:

airflow dags backfill -v -s 2020-10-06 example_bash_operator
# once there's a process running single task, do the following:
pkill -9 -f backfil

this will result in "zombie" DagRun and related task instance that will not be cleaned up by the scheduler (at least that's my understanding). Example:

Screenshot 2020-10-07 at 18 34 47

However, querying the the job table we see:

Screenshot 2020-10-07 at 18 39 07

So, the backfill job is still running according to Airflow state but that's not true as we killed the job πŸ‘Ž

Possible solution

Link a specific job to DagRun triggered by it (using the job_id) and then run a process that will kill the zombies.

This can be done either by:

Cleaning of such zombies can be easily triggered by the scheduler.

I think this may bring us closer to triggering backfill via API / UI.

Use case / motivation

Introduce the relation between DagRun and Job tables and implement a process that will clean up zombies created by BackfillJobs.

Related Issues

https://github.com/apache/airflow/pull/8227

Thanks to our friends from Databand for hinting this!

turbaszek commented 3 years ago

@ashb @mik-laj I'm happy to hear what do you think

ashb commented 3 years ago

Do you have a case where it's interesting to know what job created the DagRun? (I ask because I can't think of one immediately)

Don't forget that in the case of trigger (via CLI or webserver) there is no job id to use.

I would probably suggest naming it created_by_job_id - it's a bit clearer what it stores then just job_id

turbaszek commented 3 years ago

Do you have a case where it's interesting to know what job created the DagRun? (I ask because I can't think of one immediately)

I have an operator that triggers backfill job and observes that process. If the "parent" task dies (SIGKILL) then there are zombie tasks (scheduled/none state) from backfill job that are not cleaned up by anything. But that's probably an egde case + we use custom implementation of BackfillJob (that's why I was able to fix it using DagRun.conf for storing job_id - a hack).

However, I think that this information can be helpful in case of multiple schedulers, for example, to find that only one of them has some problems (no idea what problems).

So, basically my suggestion is about adding not crucial information that may help sometimes πŸ˜„

I would probably suggest naming it created_by_job_id - it's a bit clearer what it stores then just job_id

+1 to this

potiuk commented 3 years ago

After a discussion with @turbaszek - I think it's super useful to have this information as it allows us to do investigations on the reasons for problems and we can add more information - like we will know which scheduler created a DagRun. I would love to get it added even for 2.0 after merging the HA change.

turbaszek commented 3 years ago

Additionally, this may help us with making backfill runnable remotely.

The problem

Run:

airflow dags backfill -v -s 2020-10-06 example_bash_operator
# once there's a process running single task, do the following:
pkill -9 -f backfil

this will result in "zombie" DagRun and related task instance that will not be cleaned up by the scheduler (at least that's my understanding). Example:

Screenshot 2020-10-07 at 18 34 47

However, querying the the job table we see:

Screenshot 2020-10-07 at 18 39 07

So, the backfill job is still running according to Airflow state but that's not true as we killed the job πŸ‘Ž

Possible solution

Link a specific job to DagRun triggered by it (using the job_id) and then run a process that will kill the zombies.

This can be done either by:

Cleaning of such zombies can be easily triggered by the scheduler.

I think this may bring us closer to triggering backfill via API / UI.

WDYT? @ashb @kaxil @potiuk @mik-laj @dimberman

ashb commented 3 years ago

Got it, that makes sense.

There's code in the scheduler ha branch they will kill timed out/not heartbeating SchedulerJobs - that could be extended to Backfill job too.

I'm not sure what change would be needed to detect zombie tasks from backfilled jobs, but it's a good goal I agree.

turbaszek commented 3 years ago

I will try to tackle it once #10956 is merged

kaxil commented 3 years ago

+1 to this change, thanks @turbaszek

mik-laj commented 3 years ago

+1 from my side. This can be used to tune the scheduler better.

ashb commented 3 years ago

@turbaszek Did we do this all of or just the "job id" part?

turbaszek commented 3 years ago

@turbaszek Did we do this all of or just the "job id" part?

Just the job id, I'm going to propose an AIP in few days for redesigning backfill.