Closed NickYadance closed 10 months ago
What is your Airflow version?
What is your Airflow version?
2.3.2
I'll close the issue cuz i didn't find an easy way to reproduce. The lock contention between two queries are quite hard to construct. The workaround about this is to reduce how often scheduler checks for timeout triggers to avoid potential lock contention.
https://github.com/apache/airflow/blob/0d78ba560dec2e7ea2670744800864906622a4a4/airflow/jobs/scheduler_job.py#L1461-L1484
There is configuration trigger_timeout_check_interval
default to 15. I raise it to a reasonable higher value and the deadlock issue is greatly reduced.
I will re-open this one. It has enough information to try to avoid the deadlock in the first place - the problem is that Triggerer acquires the same locks as scheduler but in a different sequence, the right solution should be to change either Triggerer (most likely) or scheduler (rather unlikely) to apply the same sequence for locks.
Most likely Triggered shoudl attempt to loclk DagRun first and only then update task instance or even avoid locking DagRun in the first place. I believe we fixed a very similar deadlock situation recently.
I will take a look at this shortly (or maybe @ashb or @andrewgodwin might take a look at it before).
This issue has been automatically marked as stale because it has been open for 365 days without any activity. There has been several Airflow releases since last activity on this issue. Kindly asking to recheck the report against latest Airflow version and let us know if the issue is reproducible. The issue will be closed in next 30 days if no further activity occurs from the issue author.
This issue has been closed because it has not received response from the issue author.
Apache Airflow version
Other Airflow 2 version (2.3.2)
What happened
There is discussion #22553 about this but without detailed trace. There is also a similar issue #23639. Trigger will occasionly die due to DB transaction deadlock. In my case the trigger dies 5-6 times per day.
Mysql engine status
Trigger exit log
This query holds row lock in primary index (
dag_id
,task_id
,run_id
,map_index
), waiting for secondary index lock. https://github.com/apache/airflow/blob/0d78ba560dec2e7ea2670744800864906622a4a4/airflow/models/trigger.py#L118-L135This query holds row lock in secondary index as engine status telled (
state
), waiting for primary index lock, causing the deadlock. https://github.com/apache/airflow/blob/0d78ba560dec2e7ea2670744800864906622a4a4/airflow/jobs/scheduler_job.py#L1461-L148422553 and #23639 offer different solutions towards this.
with_row_lock
to queries so selected rows will be pre-locked, without lock contention.As for retry, there is already retry in previous methods. https://github.com/apache/airflow/blob/0d78ba560dec2e7ea2670744800864906622a4a4/airflow/models/trigger.py#L94-L116
What you think should happen instead
No response
How to reproduce
No response
Operating System
ubuntu
Versions of Apache Airflow Providers
No response
Deployment
Official Apache Airflow Helm Chart
Deployment details
No response
Anything else
No response
Are you willing to submit PR?
Code of Conduct