apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.28k stars 14.08k forks source link

BashOperator is killed before it can execute #41597

Closed crutis closed 1 week ago

crutis commented 3 weeks ago

Apache Airflow version

Other Airflow 2 version (please specify below)

If "Other Airflow 2 version" selected, which one?

2.8.4

What happened?

Attempting to run a simple hello world BashOperator example:

from datetime import datetime
from datetime import timedelta

from airflow import DAG
from airflow.operators.bash import BashOperator

DEFAULT_ARGS = {
    'retries': 0,
    'retry_delay': timedelta(minutes=1)
}

dag = DAG(
    'hello_world_dag',
    default_args=DEFAULT_ARGS,
    description='BashOperator test',
    schedule_interval=None,
    max_active_runs=1,
    concurrency=1,
    start_date=datetime(2024, 8, 1),
    tags=['cwilde'],
    catchup=False
)

# Define the BashOperator task
hello_world_task = BashOperator(
    task_id='hello_world_task',
    bash_command='echo "hello world"',
    dag=dag
)

# Define the task dependencies
hello_world_task

Task never executes, is killed immediately before it can run. This is not a timeout issue, it happens immediately after manually triggering the dag.

*** Found local files:
***   * /opt/airflow_logs/joblogs/dag_id=hello_world_dag/run_id=manual__2024-08-19T16:43:03.065644+00:00/task_id=hello_world_task/attempt=1.log.SchedulerJob.log
[2024-08-19, 10:43:16 MDT] {scheduler_job_runner.py:780} ERROR - Executor reports task instance <TaskInstance: hello_world_dag.hello_world_task manual__2024-08-19T16:43:03.065644+00:00 [queued]> finished (failed) although the task says it's queued. (Info: None) Was the task killed externally?

What you think should happen instead?

No response

How to reproduce

The task should execute the bash command echo "hello world" but the task is killed before the command can run

Operating System

NAME="Amazon Linux" VERSION="2" ID="amzn" ID_LIKE="centos rhel fedora" VERSION_ID="2" PRETTY_NAME="Amazon Linux 2" ANSI_COLOR="0;33" CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2" HOME_URL="https://amazonlinux.com/" SUPPORT_END="2025-06-30"

Versions of Apache Airflow Providers

apache-airflow-providers-amazon==8.19.0 apache-airflow-providers-celery==3.6.1 apache-airflow-providers-common-io==1.3.0 apache-airflow-providers-common-sql==1.11.1 apache-airflow-providers-ftp==3.7.0 apache-airflow-providers-http==4.10.0 apache-airflow-providers-imap==3.5.0 apache-airflow-providers-mysql==5.5.4 apache-airflow-providers-redis==3.6.0 apache-airflow-providers-slack==8.6.1 apache-airflow-providers-smtp==1.6.1 apache-airflow-providers-sqlite==3.7.1

Deployment

Other

Deployment details

Custom deploy on EC2

Anything else?

This fails every time I execute the DAG

Are you willing to submit PR?

Code of Conduct

boring-cyborg[bot] commented 3 weeks ago

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

romsharon98 commented 3 weeks ago

It can happen because few reasons, did you try checking scheduler logs and worker logs?

crutis commented 3 weeks ago

The entire worker log (same as above) is just this:

*** Found local files:
***   * /opt/airflow_logs/joblogs/dag_id=hello_world_dag/run_id=manual__2024-08-19T16:43:03.065644+00:00/task_id=hello_world_task/attempt=1.log.SchedulerJob.log
[2024-08-19, 10:43:16 MDT] {scheduler_job_runner.py:780} ERROR - Executor reports task instance <TaskInstance: hello_world_dag.hello_world_task manual__2024-08-19T16:43:03.065644+00:00 [queued]> finished (failed) although the task says it's queued. (Info: None) Was the task killed externally?

The scheduler log just has:

[2024-08-19T16:47:40.525+0000] {processor.py:183} INFO - Processing /opt/airflow/dags/hello_world.py took 0.228 seconds
[2024-08-19T16:48:46.990+0000] {processor.py:161} INFO - Started process (PID=15185) to work on /opt/airflow/dags/hello_world.py
[2024-08-19T16:48:46.991+0000] {processor.py:830} INFO - Processing file /opt/airflow/dags/hello_world.py for tasks to queue
[2024-08-19T16:48:46.991+0000] {logging_mixin.py:188} INFO - [2024-08-19T16:48:46.991+0000] {dagbag.py:540} INFO - Filling up the DagBag from /opt/airflow/dags/hello_world.py
[2024-08-19T16:48:47.013+0000] {processor.py:840} INFO - DAG(s) 'hello_world_dag' retrieved from /opt/airflow/dags/hello_world.py
[2024-08-19T16:48:47.110+0000] {logging_mixin.py:188} INFO - [2024-08-19T16:48:47.110+0000] {dag.py:3047} INFO - Sync 1 DAGs
[2024-08-19T16:48:47.170+0000] {logging_mixin.py:188} INFO - [2024-08-19T16:48:47.170+0000] {dag.py:3834} INFO - Setting next_dagrun for hello_world_dag to None, run_after=None
[2024-08-19T16:48:47.237+0000] {processor.py:183} INFO - Processing /opt/airflow/dags/hello_world.py took 0.251 seconds
[2024-08-19T16:49:54.769+0000] {processor.py:161} INFO - Started process (PID=16116) to work on /opt/airflow/dags/hello_world.py
[2024-08-19T16:49:54.770+0000] {processor.py:830} INFO - Processing file /opt/airflow/dags/hello_world.py for tasks to queue
[2024-08-19T16:49:54.771+0000] {logging_mixin.py:188} INFO - [2024-08-19T16:49:54.771+0000] {dagbag.py:540} INFO - Filling up the DagBag from /opt/airflow/dags/hello_world.py
[2024-08-19T16:49:54.781+0000] {processor.py:840} INFO - DAG(s) 'hello_world_dag' retrieved from /opt/airflow/dags/hello_world.py
[2024-08-19T16:49:54.911+0000] {logging_mixin.py:188} INFO - [2024-08-19T16:49:54.911+0000] {dag.py:3047} INFO - Sync 1 DAGs
[2024-08-19T16:49:54.928+0000] {logging_mixin.py:188} INFO - [2024-08-19T16:49:54.928+0000] {dag.py:3834} INFO - Setting next_dagrun for hello_world_dag to None, run_after=None
[2024-08-19T16:49:55.010+0000] {processor.py:183} INFO - Processing /opt/airflow/dags/hello_world.py took 0.245 seconds

Or is there some other log location I should be looking in?

nathadfield commented 3 weeks ago

@crutis There is nothing wrong with Airflow or your DAG code because it runs without issue when I test it.

***   * /root/airflow/logs/dag_id=hello_world_dag/run_id=manual__2024-08-21T09:52:11.804629+00:00/task_id=hello_world_task/attempt=1.log
[2024-08-21, 09:52:12 UTC] {local_task_job_runner.py:123} ▶ Pre task execution logs
[2024-08-21, 09:52:12 UTC] {subprocess.py:63} INFO - Tmp dir root location: /tmp
[2024-08-21, 09:52:12 UTC] {subprocess.py:75} INFO - Running command: ['/usr/bin/bash', '-c', 'echo "hello world"']
[2024-08-21, 09:52:12 UTC] {subprocess.py:86} INFO - Output:
[2024-08-21, 09:52:12 UTC] {subprocess.py:93} INFO - hello world
[2024-08-21, 09:52:12 UTC] {subprocess.py:97} INFO - Command exited with return code 0
[2024-08-21, 09:52:12 UTC] {taskinstance.py:339} ▼ Post task execution logs
[2024-08-21, 09:52:12 UTC] {taskinstance.py:351} INFO - Marking task as SUCCESS. dag_id=hello_world_dag, task_id=hello_world_task, run_id=manual__2024-08-21T09:52:11.804629+00:00, execution_date=20240821T095211, start_date=20240821T095212, end_date=20240821T095212
[2024-08-21, 09:52:12 UTC] {local_task_job_runner.py:261} INFO - Task exited with return code 0
[2024-08-21, 09:52:12 UTC] {taskinstance.py:3888} INFO - 0 downstream tasks scheduled from follow-on schedule check
[2024-08-21, 09:52:12 UTC] {local_task_job_runner.py:240} ▲▲▲ Log group end

So, the issue you are facing is likely due to the way in which you have Airflow installed and configured. The specific error - [2024-08-19, 10:43:16 MDT] {scheduler_job_runner.py:780} ERROR - Executor reports task instance <TaskInstance: hello_world_dag.hello_world_task manual__2024-08-19T16:43:03.065644+00:00 [queued]> finished (failed) although the task says it's queued. (Info: None) Was the task killed externally? indicates that something happened externally to Airflow that resulted in the the process running the task to be killed.

I think you need to review your EC2 deployment to check that it is sufficiently resourced and configured for running all of the various Airflow components.

crutis commented 3 weeks ago

We agree, it should work 😄 but it gets killed immediately before hitting any configured timeout. Attaching my config in case you see anything off

airflow_config.txt

romsharon98 commented 3 weeks ago

The entire worker log (same as above) is just this:

*** Found local files:
***   * /opt/airflow_logs/joblogs/dag_id=hello_world_dag/run_id=manual__2024-08-19T16:43:03.065644+00:00/task_id=hello_world_task/attempt=1.log.SchedulerJob.log
[2024-08-19, 10:43:16 MDT] {scheduler_job_runner.py:780} ERROR - Executor reports task instance <TaskInstance: hello_world_dag.hello_world_task manual__2024-08-19T16:43:03.065644+00:00 [queued]> finished (failed) although the task says it's queued. (Info: None) Was the task killed externally?

The scheduler log just has:

[2024-08-19T16:47:40.525+0000] {processor.py:183} INFO - Processing /opt/airflow/dags/hello_world.py took 0.228 seconds
[2024-08-19T16:48:46.990+0000] {processor.py:161} INFO - Started process (PID=15185) to work on /opt/airflow/dags/hello_world.py
[2024-08-19T16:48:46.991+0000] {processor.py:830} INFO - Processing file /opt/airflow/dags/hello_world.py for tasks to queue
[2024-08-19T16:48:46.991+0000] {logging_mixin.py:188} INFO - [2024-08-19T16:48:46.991+0000] {dagbag.py:540} INFO - Filling up the DagBag from /opt/airflow/dags/hello_world.py
[2024-08-19T16:48:47.013+0000] {processor.py:840} INFO - DAG(s) 'hello_world_dag' retrieved from /opt/airflow/dags/hello_world.py
[2024-08-19T16:48:47.110+0000] {logging_mixin.py:188} INFO - [2024-08-19T16:48:47.110+0000] {dag.py:3047} INFO - Sync 1 DAGs
[2024-08-19T16:48:47.170+0000] {logging_mixin.py:188} INFO - [2024-08-19T16:48:47.170+0000] {dag.py:3834} INFO - Setting next_dagrun for hello_world_dag to None, run_after=None
[2024-08-19T16:48:47.237+0000] {processor.py:183} INFO - Processing /opt/airflow/dags/hello_world.py took 0.251 seconds
[2024-08-19T16:49:54.769+0000] {processor.py:161} INFO - Started process (PID=16116) to work on /opt/airflow/dags/hello_world.py
[2024-08-19T16:49:54.770+0000] {processor.py:830} INFO - Processing file /opt/airflow/dags/hello_world.py for tasks to queue
[2024-08-19T16:49:54.771+0000] {logging_mixin.py:188} INFO - [2024-08-19T16:49:54.771+0000] {dagbag.py:540} INFO - Filling up the DagBag from /opt/airflow/dags/hello_world.py
[2024-08-19T16:49:54.781+0000] {processor.py:840} INFO - DAG(s) 'hello_world_dag' retrieved from /opt/airflow/dags/hello_world.py
[2024-08-19T16:49:54.911+0000] {logging_mixin.py:188} INFO - [2024-08-19T16:49:54.911+0000] {dag.py:3047} INFO - Sync 1 DAGs
[2024-08-19T16:49:54.928+0000] {logging_mixin.py:188} INFO - [2024-08-19T16:49:54.928+0000] {dag.py:3834} INFO - Setting next_dagrun for hello_world_dag to None, run_after=None
[2024-08-19T16:49:55.010+0000] {processor.py:183} INFO - Processing /opt/airflow/dags/hello_world.py took 0.245 seconds

Or is there some other log location I should be looking in?

When I asked for the worker logs, I wasn’t referring to this. I meant what @nathadfield suggested—going to your deployment and checking the logs directly in the pod. It seems like the pod might not have even started, and reviewing the events/logs there should give you the answer.

crutis commented 3 weeks ago

I guess this is where I'm confused, we're running on EC2 with celery, not kubernetes, we don't have any pods to get logs from