Open trlopes1974 opened 1 month ago
That 10 minutes gap could be because of this config: https://github.com/apache/airflow/blob/6229c1c98a2a78d00255d6b6c5b70d032cff9b80/airflow/config_templates/config.yml#L2600-L2606
Maybe, it makes some sense as we do not have that setting in the configuration. But, what is bothering me is WHY? why did it timeout after being queued? We have no exhaustion anywhere, not in CPU, Memory, pool slots, concurrency. I'd say that at the current time we have a very light system...
So, I believe that the real question is, why is the task queued but never started?
+-1year with some cleanups
All timeouts on our config: dagbag_import_timeout = 120.0 dag_file_processor_timeout = 180 default_task_execution_timeout = web_server_master_timeout = 120 web_server_worker_timeout = 120 smtp_timeout = 30 operation_timeout = 2 stalled_task_timeout = 0 default_timeout = 604800
Maybe, it makes some sense as we do not have that setting in the configuration. But, what is bothering me is WHY? why did it timeout after being queued? We have no exhaustion anywhere, not in CPU, Memory, pool slots, concurrency. I'd say that at the current time we have a very light system...
So, I believe that the real question is, why is the task queued but never started?
+-1year with some cleanups
All timeouts on our config: dagbag_import_timeout = 120.0 dag_file_processor_timeout = 180 default_task_execution_timeout = web_server_master_timeout = 120 web_server_worker_timeout = 120 smtp_timeout = 30 operation_timeout = 2 stalled_task_timeout = 0 default_timeout = 604800
It's the same case for me when the incident happened the other day. Workers are all online, but no task gets executed. I notice something abnormal in scheduler log when this happens: all tasks in a dag was repeatedly queued for thousands of times in one second. Looks like scheduler gets into a strange state
There is something messed up. Today I had a filesystem sensor that failed by timeout (airflow task) but the celery task was successful ...
A terça, 6/08/2024, 20:15, scaoupgrade @.***> escreveu:
Maybe, it makes some sense as we do not have that setting in the configuration. But, what is bothering me is WHY? why did it timeout after being queued? We have no exhaustion anywhere, not in CPU, Memory, pool slots, concurrency. I'd say that at the current time we have a very light system...
So, I believe that the real question is, why is the task queued but never started?
All timeouts on our config: dagbag_import_timeout = 120.0 dag_file_processor_timeout = 180 default_task_execution_timeout = web_server_master_timeout = 120 web_server_worker_timeout = 120 smtp_timeout = 30 operation_timeout = 2 stalled_task_timeout = 0 default_timeout = 604800
It's the same case for me when the incident happened the other day. Workers are all online, but no task gets executed. I notice something abnormal in scheduler log when this happens: all tasks in a dag was repeatedly queued for thousands of times in one second. Looks like scheduler gets into a strange state
— Reply to this email directly, view it on GitHub https://github.com/apache/airflow/issues/41276#issuecomment-2271973022, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD7TGQF6SPFC3OJWACXLBN3ZQEOEPAVCNFSM6AAAAABMBY37NCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZRHE3TGMBSGI . You are receiving this because you authored the thread.Message ID: @.***>
Apache Airflow version
Other Airflow 2 version (please specify below)
If "Other Airflow 2 version" selected, which one?
2.9.2
What happened?
Case 2 of issue #39717
Task fails with the message: Executor reports task instance <TaskInstance: CSDISPATCHER_OTHERS.dispatch_restores scheduled__2024-08-05T19:47:00+00:00 [queued]> finished (failed) although the task says it's queued. (Info: None) Was the task killed externally.
Despite that the Airflow task has an external_executor_id, it is not registered in flower ( celery)
Dagrun Running | Task instance's dagrun was not in the 'running' state but in the state 'failed'. -- | -- Task Instance State | Task is in the 'failed' state. external_executor_id a01b358f-ad34-4b16-81b9-fd69218f613e does not exist in flower / celery
look at the timestamps in the logs:
there is a gap of 10minutes betwwen the Start ( dummy task) and the dispatch_restores task. And this behaviour is recurrent, (the 10m gap)
and in the task log: 'attempt=1.log.SchedulerJob.log' (tkapp) ttauto@slautop02$ cat attempt=1.log.SchedulerJob.log [2024-08-05T21:02:15.585+0100] {event_scheduler.py:40} WARNING - Marking task instance <TaskInstance: CSDISPATCHER_OTHERS.dispatch_restores scheduled2024-08-05T19:47:00+00:00 [queued]> stuck in queued as failed. If the task instance has available retries, it will be retried. [2024-08-05T21:02:16.750+0100] {scheduler_job_runner.py:843} ERROR - Executor reports task instance <TaskInstance: CSDISPATCHER_OTHERS.dispatch_restores scheduled2024-08-05T19:47:00+00:00 [queued]> finished (failed) although the task says it's queued. (Info: None) Was the task killed externally?
What you think should happen instead?
The external task should exist in celery and the reason for the task to be marked as failed should be explicit. There is no other info that justifies the task to fail.
How to reproduce
I just happens
Operating System
NAME="Red Hat Enterprise Linux" VERSION="8.9 (Ootpa)" ID="rhel" ID_LIKE="fedora" VERSION_ID="8.9" PLATFORM_ID="platform:el8" PRETTY_NAME="Red Hat Enterprise Linux 8.9 (Ootpa)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos" HOME_URL="https://www.redhat.com/" DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8" REDHAT_BUGZILLA_PRODUCT_VERSION=8.9 REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux" REDHAT_SUPPORT_PRODUCT_VERSION="8.9"
Versions of Apache Airflow Providers
apache-airflow==2.9.2 apache-airflow-providers-celery==3.7.2 apache-airflow-providers-common-io==1.3.2 apache-airflow-providers-common-sql==1.14.0 apache-airflow-providers-fab==1.1.1 apache-airflow-providers-ftp==3.9.1 apache-airflow-providers-hashicorp==3.7.1 apache-airflow-providers-http==4.11.1 apache-airflow-providers-imap==3.6.1 apache-airflow-providers-postgres==5.11.1 apache-airflow-providers-sftp==4.10.1 apache-airflow-providers-smtp==1.7.1 apache-airflow-providers-sqlite==3.8.1 apache-airflow-providers-ssh==3.11.1
Deployment
Other
Deployment details
Just simple airflow install on linux box. Using PostgresQLm Rabbitmq and celery executor
Anything else?
randomly
Are you willing to submit PR?
Code of Conduct