Open boushphong opened 11 months ago
@boushphong Thanks for logging and I have been able to replicate it. I see that you indicated you'd be willing to submit a PR, so shall I assign you to this issue?
Sure things. Let me work on this.
@nathadfield Just submitted a pull request. It's a DRAFT PR for now. https://github.com/apache/airflow/pull/36032/files#diff-4253adbb36bfb93cb75ab00c7d509518134e5bf1ad16473b64a2a6d8fa456c92L208-L214
I went with the idea to remove primary key for the dataset_dag_run_queue
table so that when we insert a new record in the table as in (code):
stmt = insert(DatasetDagRunQueue).values(dataset_id=dataset.id).on_conflict_do_nothing()
so that we won't face any conflict issue because if a Dag
has multiple tasks updating the same Dataset
, we would get a conflict because we insert 2 records but they'd conflict with each other due to the primary key constraint.
Just briefing my idea before committing more time to this solution. WDYT? By the way, if I make changes to the model, Do I have to modify the migrations package and if so where would I have to look into. Cheers!
@boushphong I'm probably not best to comment on this as I don't really know much about this aspect of Airflow. Perhaps tag some of the people who have also worked in this area on the PR?
Apache Airflow version
2.7.3
What happened
When multiple airflow tasks finish at about the same time, and those tasks are also responsible for triggering other Dag via Dataset. There will be missing dataset triggered dag runs.
For example: A Dag that has 2 tasks triggering another Dag via Dataset, there must be 2 dataset triggered dag runs for the triggered dag. From my observation, if 2 tasks finishes at about the same time, there will be missing triggered dag runs, so there might be only 1 dag run will be triggered instead of 2.
What you think should happen instead
The number of dataset triggered dag runs has to be added up to the number of tasks (that triggers the dataset run) that finishes at the same time.
How to reproduce
Code to reproduce:
The dataset_triggered_runs DAG have 2 tasks (that triggers dataset run) finishing at different time, and there are 2 dataset triggered dag runs, which is expected.
However, the missing_dataset_triggered_runs DAG have 2 tasks (that triggers dataset run) finishing at about the same time, and there is only 1 dataset triggered dag run, which is unexpected. This is very likely a bug.
Operating System
Docker
Versions of Apache Airflow Providers
No response
Deployment
Other Docker-based deployment
Deployment details
No response
Anything else
No response
Are you willing to submit PR?
Code of Conduct