apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.28k stars 14.08k forks source link

Executor reports task instance <TaskInstance: XXX [queued]> finished (failed) although the task says it's queued. (Info: None) Was the task killed externally #41276

Open trlopes1974 opened 1 month ago

trlopes1974 commented 1 month ago

Apache Airflow version

Other Airflow 2 version (please specify below)

If "Other Airflow 2 version" selected, which one?

2.9.2

What happened?

Case 2 of issue #39717

Task fails with the message: Executor reports task instance <TaskInstance: CSDISPATCHER_OTHERS.dispatch_restores scheduled__2024-08-05T19:47:00+00:00 [queued]> finished (failed) although the task says it's queued. (Info: None) Was the task killed externally.

Despite that the Airflow task has an external_executor_id, it is not registered in flower ( celery)

Dagrun Running | Task instance's dagrun was not in the 'running' state but in the state 'failed'. -- | -- Task Instance State | Task is in the 'failed' state. external_executor_id a01b358f-ad34-4b16-81b9-fd69218f613e does not exist in flower / celery

look at the timestamps in the logs: image

image

there is a gap of 10minutes betwwen the Start ( dummy task) and the dispatch_restores task. And this behaviour is recurrent, (the 10m gap)

and in the task log: 'attempt=1.log.SchedulerJob.log' (tkapp) ttauto@slautop02$ cat attempt=1.log.SchedulerJob.log [2024-08-05T21:02:15.585+0100] {event_scheduler.py:40} WARNING - Marking task instance <TaskInstance: CSDISPATCHER_OTHERS.dispatch_restores scheduled2024-08-05T19:47:00+00:00 [queued]> stuck in queued as failed. If the task instance has available retries, it will be retried. [2024-08-05T21:02:16.750+0100] {scheduler_job_runner.py:843} ERROR - Executor reports task instance <TaskInstance: CSDISPATCHER_OTHERS.dispatch_restores scheduled2024-08-05T19:47:00+00:00 [queued]> finished (failed) although the task says it's queued. (Info: None) Was the task killed externally?

What you think should happen instead?

The external task should exist in celery and the reason for the task to be marked as failed should be explicit. There is no other info that justifies the task to fail.

How to reproduce

I just happens

Operating System

NAME="Red Hat Enterprise Linux" VERSION="8.9 (Ootpa)" ID="rhel" ID_LIKE="fedora" VERSION_ID="8.9" PLATFORM_ID="platform:el8" PRETTY_NAME="Red Hat Enterprise Linux 8.9 (Ootpa)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos" HOME_URL="https://www.redhat.com/" DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8" REDHAT_BUGZILLA_PRODUCT_VERSION=8.9 REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux" REDHAT_SUPPORT_PRODUCT_VERSION="8.9"

Versions of Apache Airflow Providers

apache-airflow==2.9.2 apache-airflow-providers-celery==3.7.2 apache-airflow-providers-common-io==1.3.2 apache-airflow-providers-common-sql==1.14.0 apache-airflow-providers-fab==1.1.1 apache-airflow-providers-ftp==3.9.1 apache-airflow-providers-hashicorp==3.7.1 apache-airflow-providers-http==4.11.1 apache-airflow-providers-imap==3.6.1 apache-airflow-providers-postgres==5.11.1 apache-airflow-providers-sftp==4.10.1 apache-airflow-providers-smtp==1.7.1 apache-airflow-providers-sqlite==3.8.1 apache-airflow-providers-ssh==3.11.1

Deployment

Other

Deployment details

Just simple airflow install on linux box. Using PostgresQLm Rabbitmq and celery executor

Anything else?

randomly

Are you willing to submit PR?

Code of Conduct

scaoupgrade commented 1 month ago

That 10 minutes gap could be because of this config: https://github.com/apache/airflow/blob/6229c1c98a2a78d00255d6b6c5b70d032cff9b80/airflow/config_templates/config.yml#L2600-L2606

trlopes1974 commented 1 month ago

Maybe, it makes some sense as we do not have that setting in the configuration. But, what is bothering me is WHY? why did it timeout after being queued? We have no exhaustion anywhere, not in CPU, Memory, pool slots, concurrency. I'd say that at the current time we have a very light system...

So, I believe that the real question is, why is the task queued but never started?

+-1year with some cleanups image

All timeouts on our config: dagbag_import_timeout = 120.0 dag_file_processor_timeout = 180 default_task_execution_timeout = web_server_master_timeout = 120 web_server_worker_timeout = 120 smtp_timeout = 30 operation_timeout = 2 stalled_task_timeout = 0 default_timeout = 604800

scaoupgrade commented 1 month ago

Maybe, it makes some sense as we do not have that setting in the configuration. But, what is bothering me is WHY? why did it timeout after being queued? We have no exhaustion anywhere, not in CPU, Memory, pool slots, concurrency. I'd say that at the current time we have a very light system...

So, I believe that the real question is, why is the task queued but never started?

+-1year with some cleanups image

All timeouts on our config: dagbag_import_timeout = 120.0 dag_file_processor_timeout = 180 default_task_execution_timeout = web_server_master_timeout = 120 web_server_worker_timeout = 120 smtp_timeout = 30 operation_timeout = 2 stalled_task_timeout = 0 default_timeout = 604800

It's the same case for me when the incident happened the other day. Workers are all online, but no task gets executed. I notice something abnormal in scheduler log when this happens: all tasks in a dag was repeatedly queued for thousands of times in one second. Looks like scheduler gets into a strange state

trlopes1974 commented 1 month ago

There is something messed up. Today I had a filesystem sensor that failed by timeout (airflow task) but the celery task was successful ...

A terça, 6/08/2024, 20:15, scaoupgrade @.***> escreveu:

Maybe, it makes some sense as we do not have that setting in the configuration. But, what is bothering me is WHY? why did it timeout after being queued? We have no exhaustion anywhere, not in CPU, Memory, pool slots, concurrency. I'd say that at the current time we have a very light system...

So, I believe that the real question is, why is the task queued but never started?

+-1year with some cleanups [image: image] https://private-user-images.githubusercontent.com/16724800/355507694-6aa42a5d-8980-4f8f-b31b-3202f99f276f.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjI5NzE4NDksIm5iZiI6MTcyMjk3MTU0OSwicGF0aCI6Ii8xNjcyNDgwMC8zNTU1MDc2OTQtNmFhNDJhNWQtODk4MC00ZjhmLWIzMWItMzIwMmY5OWYyNzZmLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA4MDYlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwODA2VDE5MTIyOVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTI2NjYzZDk2YzNkZmNhNjNjNzM3YWM0MDczMmY0Nzk4MDI2ZTM5OTQyYjEzNjU3ODZlMGU1ZDFmYTA0MTZhMzgmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.KROI5i_Ta7uAa9f4BEntuF3Ibz3TbkhZ5FPdi1-ZkGk

All timeouts on our config: dagbag_import_timeout = 120.0 dag_file_processor_timeout = 180 default_task_execution_timeout = web_server_master_timeout = 120 web_server_worker_timeout = 120 smtp_timeout = 30 operation_timeout = 2 stalled_task_timeout = 0 default_timeout = 604800

It's the same case for me when the incident happened the other day. Workers are all online, but no task gets executed. I notice something abnormal in scheduler log when this happens: all tasks in a dag was repeatedly queued for thousands of times in one second. Looks like scheduler gets into a strange state

— Reply to this email directly, view it on GitHub https://github.com/apache/airflow/issues/41276#issuecomment-2271973022, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD7TGQF6SPFC3OJWACXLBN3ZQEOEPAVCNFSM6AAAAABMBY37NCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZRHE3TGMBSGI . You are receiving this because you authored the thread.Message ID: @.***>

vaibhavnsingh commented 1 day ago