Open turbaszek opened 4 years ago
@ashb @mik-laj I'm happy to hear what do you think
Do you have a case where it's interesting to know what job created the DagRun? (I ask because I can't think of one immediately)
Don't forget that in the case of trigger (via CLI or webserver) there is no job id to use.
I would probably suggest naming it created_by_job_id
- it's a bit clearer what it stores then just job_id
Do you have a case where it's interesting to know what job created the DagRun? (I ask because I can't think of one immediately)
I have an operator that triggers backfill job and observes that process. If the "parent" task dies (SIGKILL) then there are zombie tasks (scheduled/none state) from backfill job that are not cleaned up by anything. But that's probably an egde case + we use custom implementation of BackfillJob (that's why I was able to fix it using DagRun.conf
for storing job_id - a hack).
However, I think that this information can be helpful in case of multiple schedulers, for example, to find that only one of them has some problems (no idea what problems).
So, basically my suggestion is about adding not crucial information that may help sometimes π
I would probably suggest naming it
created_by_job_id
- it's a bit clearer what it stores then justjob_id
+1 to this
After a discussion with @turbaszek - I think it's super useful to have this information as it allows us to do investigations on the reasons for problems and we can add more information - like we will know which scheduler created a DagRun. I would love to get it added even for 2.0 after merging the HA change.
Additionally, this may help us with making backfill runnable remotely.
Run:
airflow dags backfill -v -s 2020-10-06 example_bash_operator
# once there's a process running single task, do the following:
pkill -9 -f backfil
this will result in "zombie" DagRun and related task instance that will not be cleaned up by the scheduler (at least that's my understanding). Example:
However, querying the the job table we see:
So, the backfill job is still running according to Airflow state but that's not true as we killed the job π
Link a specific job to DagRun triggered by it (using the job_id
) and then run a process that will kill the zombies.
This can be done either by:
Cleaning of such zombies can be easily triggered by the scheduler.
I think this may bring us closer to triggering backfill via API / UI.
WDYT? @ashb @kaxil @potiuk @mik-laj @dimberman
Got it, that makes sense.
There's code in the scheduler ha branch they will kill timed out/not heartbeating SchedulerJobs - that could be extended to Backfill job too.
I'm not sure what change would be needed to detect zombie tasks from backfilled jobs, but it's a good goal I agree.
I will try to tackle it once #10956 is merged
+1 to this change, thanks @turbaszek
+1 from my side. This can be used to tune the scheduler better.
@turbaszek Did we do this all of or just the "job id" part?
@turbaszek Did we do this all of or just the "job id" part?
Just the job id, I'm going to propose an AIP in few days for redesigning backfill.
Description
Currently, we have
run_type
in the DagRun table but there's no way to determine what job created a DagRun (no 1-1 relation between DagRun and Job). This can be helpful in debugging (I think especially in the case of Scheduler HA and BackfillJobs) as this will also allow users to check which scheduler / backfill job triggered their task (now the job_id in TaskInstance is always id of LocalTaskJob).Introducing job_id may help us with making backfill runnable remotely as we will be able to clean up after jobs that failed thus reducing possible zombies.
The problem
Run:
this will result in "zombie" DagRun and related task instance that will not be cleaned up by the scheduler (at least that's my understanding). Example:
However, querying the the job table we see:
So, the backfill job is still running according to Airflow state but that's not true as we killed the job π
Possible solution
Link a specific job to DagRun triggered by it (using the
job_id
) and then run a process that will kill the zombies.This can be done either by:
Cleaning of such zombies can be easily triggered by the scheduler.
I think this may bring us closer to triggering backfill via API / UI.
Use case / motivation
Introduce the relation between DagRun and Job tables and implement a process that will clean up zombies created by BackfillJobs.
Related Issues
https://github.com/apache/airflow/pull/8227
Thanks to our friends from Databand for hinting this!