Closed ldacey closed 2 years ago
Comment from @ldacey slack - this seems like 2.3.3-only issue. 2.3.2 works fine
CC: @uranusjr @ashb - you might want to take a look, seems like a regression in 2.3.3
Just commenting that I'm also seeing this issue on 2.3.3 under similar circumstances
We are also facing this issue when we use dynamic task mapping in 2.3.3 The scheduler crashes after a while when we enable the dag that uses dynamic task mapping As a result, we need to downgrade to 2.3.2 But 2.3.2 has an issue related to Task Group, UI shows incorrect status in grid view (https://github.com/apache/airflow/issues/24998). This affect our daily operation as we cannot see the status directly from grid view but only when we expand those task group I wonder if this issue will be fixed in 2.3.4 and what is the target release date of 2.3.4?
@Idacey, Can you show the full log? It's not clear to me where this started to fail
I'm also facing this issue, I'm also using 2.3.3 with dynamic task mapping. But using SQL Server for metadata database. The error on log is basically same. (Inserting duplicate key on 'dbo.task_instance' is not possible because it violates constraint 'task_instance_pkey') After the error occures, Like other people here, Tasks and DAGs stuck in "running" state until Airflow banner notifies me that there is no scheduler heartbeat.
More info on my Airflow environment
sqlalchemy.exc.IntegrityError: (pyodbc.IntegrityError) ('23000', "[23000] [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]PRIMARY KEY 제약 조건 'task_instance_pkey'을(를) 위반했습니다. 개체 'dbo.task_instance'에 중복 키를 삽입할 수 없습니다. 중복 키 값은 (az_partner_etl_usage, azplan_unbilled_lineitems, scheduled__2022-08-02T00:00:00+00:00, 0)입니다. (2627)
Error code 2627 - List of SQL Server error codes Violation of PRIMARY KEY constraint 'task_instance_pkey'. Cannot insert duplicate key in object 'dbo.task_instance'. The duplicate key value is (az_partner_etl_usage, azplan_unbilled_lineitems, scheduled__2022-08-02T00:00:00+00:00, 0)
It seems like the merged PR(#25532) only solved scheduler crashes which is part of this issue but not the real issue. I feel we should still keep it open. WDYT? cc: @potiuk
I think this issue is about "crash" which was the "strong" manifestation of the problem and did not let us see all the important details of the actual issue, but rather than re-openinig this one - I'd open another one solely to handle the duplicate task instance one if we can get hold of more information and logs on this one.
Mostly because when we solve the crash, we might have more data/information about the underlying problem (they are a bit clouded now and it's even likely there are few different reasons that triggered this crash).
We have another issue opened for the data integrity problem (#25200), we can track things there.
Hello all, We have observed that in Airflow scheduler heartbeat has stopped and all the DAGs were in queued or running state. We have checked scheduler logs and got the below error. May i know what might be the cause for this.
Details: We are running Airflow in Google cloud Composer service with below versions composer-2.1.4 airflow-2.4.3
Error: sqlalchemy.exc.PendingRollbackError: This Session's transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback(). Original exception was: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "task_instance_pkey"
Detailed logs and circumstance would be needed (see https://github.com/apache/airflow/issues/25200 for example how detailed information helped in fixing similar issue).
@potiuk
Sure i'm providing the details. Same issue has occurred even in one of our testing environment.
Apache airflow version: 2.4.3 Deployed in Google cloud composer: 2.1.4
What happened:
We have one scheduling DAG which will trigger at 12:00 AM. In that DAG we have main task and sub tasks as well. Sub tasks will be created dynamically based on few arguments.
When Sub task (which has created dynamically) starts in DAG, I can see instance details as null (means no instance has created for that task. Please refer screenshot 1). So i don't get any logs for that task.
Screenshot 1:
But when i checked the logs in composer service. I can see the error log which has occurred under scheduler and time is almost near to stopping of scheduler heart beat. (Please refer screenshot 2)
Screenshot 2:
Need clarification regarding this issue.
Please let me know if any other details are required.
@uranusjr - wasn't the "null" mapped task instance fixed since 2.4.3 ? I cannot find it easily
Apache Airflow version
2.3.3 (latest released)
What happened
Our schedulers have crashed on two occasions after upgrading to Airflow 2.3.3. The same DAG is responsible each time, but this is likely due to the fact that it is the only dynamic task mapping DAG running right now (catching up some historical data). This DAG uses the same imported @task function that many other DAGs used successfully with no errors. The issue has only occurred after upgrading to Airflow 2.3.3
What you think should happen instead
This error should not be raised - there should be no record of this task instance because, according to the UI, the task has not run yet. The extract task is green but the transform task which raised the error is blank. The DAG run is stuck in the running state until eventually the scheduler dies and the Airflow banner notifies me that there is no scheduler heartbeat.
Also, this same DAG (and other which use the same imported external @task function) ran for hours before the upgrade to Airflow 2.3.3.
How to reproduce
Run a dynamic task mapping DAG in Airflow 2.3.3
Operating System
Ubuntu 20.04
Versions of Apache Airflow Providers
The 2.3.3 constraints file for Python 3.10 is used for the specific versions:
Deployment
Other Docker-based deployment
Deployment details
I am using two schedulers which run on separate nodes.
Anything else
The DAG only allows 1 max active DAG run at a time.
catchup=True
is enabled and it has been running to fill in all tasks since 05/10 start_date.The extract() task returns a list of 1 or more files which have been saved on cloud storage. The transform task processes each of these paths dynamically. I have used these same tasks (imported from another file) for over 15 different DAGs so far without issue. The problem only occurred yesterday sometime after updating Airflow to 2.3.3.
My
transform_files
task is just a function which expands the XCom Arg of the extract task and transforms each file. Nearly everything is based on DAG params which are customized in the DAG.Deleting the DAG run which caused the error and restarting the Airflow scheduler fixes the issue temporarily. If I do not delete the DAG run then the scheduler will keep dying.
Are you willing to submit PR?
Code of Conduct