Tasks intermittently gets terminated with SIGTERM on kubernetes executor

Nimesh-K-Makwana commented 3 years ago

Apache Airflow version

2.1.3 (latest released)

Operating System

Linux

Versions of Apache Airflow Providers

No response

Deployment

Other

Deployment details

Have tried env variables as given in this github issue issues/14672: AIRFLOWCOREKILLED_TASK_CLEANUP_TIME: "604800" AIRFLOWSCHEDULERSCHEDULE_AFTER_TASK_EXECUTION: "False"

What happened

[2021-09-04 10:28:50,536] {local_task_job.py:80} ERROR - Received SIGTERM. Terminating subprocesses [2021-09-04 10:28:50,536] {process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 33 [2021-09-04 10:28:50,537] {taskinstance.py:1235} ERROR - Received SIGTERM. Terminating subprocesses. [2021-09-04 10:28:52,568] {taskinstance.py:1462} ERROR - Task failed with exception Traceback (most recent call last): File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1164, in _run_raw_task self._prepare_and_execute_task_with_callbacks(context, task) File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1282, in _prepare_and_execute_task_with_callbacks result = self._execute_task(context, task_copy) File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1307, in _execute_task result = task_copy.execute(context=context) File "/home/airflow/.local/lib/python3.8/site-packages/airflow/operators/python.py", line 150, in execute return_value = self.execute_callable() File "/home/airflow/.local/lib/python3.8/site-packages/airflow/operators/python.py", line 161, in execute_callable return self.python_callable(*self.op_args, **self.op_kwargs) File "/opt/airflow/dags/repo/dags/elastit_schedular/waiting_task_processor.py", line 59, in trigger_task time.sleep(1) File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1237, in signal_handler raise AirflowException("Task received SIGTERM signal") airflow.exceptions.AirflowException: Task received SIGTERM signal

What you expected to happen

Dag must get executed successfully without any sigterm signal.

How to reproduce

No response

Anything else

No response

Are you willing to submit PR?

[x] Yes I am willing to submit a PR!

Code of Conduct

[X] I agree to follow this project's Code of Conduct

sraviteja07 commented 2 years ago

We are also facing the same problem and facing the issue with belo stack trace :

**ERROR - Received SIGTERM. Terminating subprocesses. ERROR - Task failed with exception Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1164, in _run_raw_task self._prepare_and_execute_task_with_callbacks(context, task) File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1282, in _prepare_and_execute_task_with_callbacks result = self._execute_task(context, task_copy) File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1312, in _execute_task result = task_copy.execute(context=context) File "/usr/local/airflow/dags/flow_airflow/operator/dts_operator.py", line 100, in execute self.to_email_address) File "/usr/local/airflow/dags/flow_airflow/operator/dts_operator.py", line 111, in run_process for line in iter(process.stdout.readline, ""): File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1237, in signal_handler raise AirflowException("Task received SIGTERM signal")

WARNING - process psutil.Process(pid=25806, name='python3', status='zombie', started='20:09:10') did not respond to SIGTERM. Trying SIGKILL

{process_utils.py:124} ERROR - Process psutil.Process(pid=54556, name='python3', status='zombie', started='20:09:10') (54556) could not be killed. Giving up.**

We also have thousands of tasks and happens to some intermittently

GHGHGHKO commented 2 years ago

We got SIGETERM errors on about 250 dags

solved this link

Prerequisites

bparhy commented 2 years ago

@GHGHGHKO thanks for the reply. We are seeing the issue on our task pods which even after success are in the failed state in k8s. So the output of this is Error pods in k8s. We control the pod resources using request_memory and limit_memory and limit_cpu. Are you suggesting to increase each task limit to 4GB. That will be huge and our cluster cannot have that amount of resources as we have a lot of tasks that runs parallel. Please let me know.

gmyrianthous commented 2 years ago

Possible fix

I was having the same problem after upgrading from Airflow v1.10.15 to v2.2.5 and was seeing the error in long-running DAGs having a fairly high number of tasks.

Apparently, the dagrun_timeout in airflow.models.DAG was not respected in earlier Airflow versions so I noticed that the DAGs I was trying to migrate to the new Airflow instance were running for much longer than the specified dagrun_timeout.

The solution for me was to increase the dagrun_timeout (e.g. dagrun_timeout=datetime.timedelta(minutes=120)).

Note that this variable is effective only for scheduled tasks (in other words with DAGs with a specified schedule_interval).

allenhaozi commented 2 years ago

I have the same problem I'm using airflow 2.2.5, SparkKubernetesOperator and SparkKubernetesSensor

Driver is running But the sensor displays the following logs until the number of retries exceeds the threshold

2022-06-17, 18:05:52 CST] {spark_kubernetes.py:104} INFO - Poking: load-customer-data-init-1655486757.7793136
[2022-06-17, 18:05:52 CST] {spark_kubernetes.py:124} INFO - Spark application is still in state: RUNNING
[2022-06-17, 18:06:49 CST] {local_task_job.py:211} WARNING - State of this instance has been externally set to up_for_retry. Terminating instance.
[2022-06-17, 18:06:49 CST] {process_utils.py:120} INFO - Sending Signals.SIGTERM to group 84. PIDs of all processes in the group: [84]
[2022-06-17, 18:06:49 CST] {process_utils.py:75} INFO - Sending the signal Signals.SIGTERM to group 84
[2022-06-17, 18:06:49 CST] {taskinstance.py:1430} ERROR - Received SIGTERM. Terminating subprocesses.
[2022-06-17, 18:06:49 CST] {taskinstance.py:1774} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/sensors/base.py", line 249, in execute
    time.sleep(self._get_next_poke_interval(started_at, run_duration, try_number))
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1432, in signal_handler
    raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal
[2022-06-17, 18:06:49 CST] {taskinstance.py:1278} INFO - Marking task as FAILED. dag_id=salesforecast-load-init, task_id=load-customer-data-init-sensor, execution_date=20220617T172033, start_date=20220617T175649, end_date=20220617T180649
[2022-06-17, 18:06:49 CST] {standard_task_runner.py:93} ERROR - Failed to execute job 24 for task load-customer-data-init-sensor (Task received SIGTERM signal; 84)
[2022-06-17, 18:06:49 CST] {process_utils.py:70} INFO - Process psutil.Process(pid=84, status='terminated', exitcode=1, started='17:56:48') (84) terminated with exit code 1

potiuk commented 2 years ago

I have the same problem I'm using airflow 2.2.5, SparkKubernetesOperator and SparkKubernetesSensor

Driver is running But the sensor displays the following logs until the number of retries exceeds the threshold

2022-06-17, 18:05:52 CST] {spark_kubernetes.py:104} INFO - Poking: load-customer-data-init-1655486757.7793136
[2022-06-17, 18:05:52 CST] {spark_kubernetes.py:124} INFO - Spark application is still in state: RUNNING
[2022-06-17, 18:06:49 CST] {local_task_job.py:211} WARNING - State of this instance has been externally set to up_for_retry. Terminating instance.
[2022-06-17, 18:06:49 CST] {process_utils.py:120} INFO - Sending Signals.SIGTERM to group 84. PIDs of all processes in the group: [84]
[2022-06-17, 18:06:49 CST] {process_utils.py:75} INFO - Sending the signal Signals.SIGTERM to group 84
[2022-06-17, 18:06:49 CST] {taskinstance.py:1430} ERROR - Received SIGTERM. Terminating subprocesses.
[2022-06-17, 18:06:49 CST] {taskinstance.py:1774} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/sensors/base.py", line 249, in execute
    time.sleep(self._get_next_poke_interval(started_at, run_duration, try_number))
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1432, in signal_handler
    raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal
[2022-06-17, 18:06:49 CST] {taskinstance.py:1278} INFO - Marking task as FAILED. dag_id=salesforecast-load-init, task_id=load-customer-data-init-sensor, execution_date=20220617T172033, start_date=20220617T175649, end_date=20220617T180649
[2022-06-17, 18:06:49 CST] {standard_task_runner.py:93} ERROR - Failed to execute job 24 for task load-customer-data-init-sensor (Task received SIGTERM signal; 84)
[2022-06-17, 18:06:49 CST] {process_utils.py:70} INFO - Process psutil.Process(pid=84, status='terminated', exitcode=1, started='17:56:48') (84) terminated with exit code 1

Did you try the earlier suggestions with dagrun_timeout? Do you know what is sending SIGTERM to this task?

kcphila commented 2 years ago

Hi all,

From the discussion over at issue 17507, I may have identified issue when the SIGTERM is sent with the Recorded pid<> does not match the current pid <> error, but I'm running LocalExecutor and not kubernetes.

For me, I think this is happening when RUN_AS_USER is set for a task and the heartbeat is checked when the task instance pid is not set (None). In these cases, the recorded_pid gets set to the parent of running task supervisor process, which is Executor itself, instead of the task runner.

I don't know if this will address the issue with kubernetes or celery executor, but it seems very likely to be the same issue. It will take me a little while to set up the dev environment and do the testing before submitting a PR, but if you want to try doing a local install, feel free to give it a whirl. I have a tentative branch set up here: https://github.com/krcrouse/airflow/tree/fix-pid-check

allenhaozi commented 2 years ago

I have the same problem I'm using airflow 2.2.5, SparkKubernetesOperator and SparkKubernetesSensor Driver is running But the sensor displays the following logs until the number of retries exceeds the threshold

2022-06-17, 18:05:52 CST] {spark_kubernetes.py:104} INFO - Poking: load-customer-data-init-1655486757.7793136
[2022-06-17, 18:05:52 CST] {spark_kubernetes.py:124} INFO - Spark application is still in state: RUNNING
[2022-06-17, 18:06:49 CST] {local_task_job.py:211} WARNING - State of this instance has been externally set to up_for_retry. Terminating instance.
[2022-06-17, 18:06:49 CST] {process_utils.py:120} INFO - Sending Signals.SIGTERM to group 84. PIDs of all processes in the group: [84]
[2022-06-17, 18:06:49 CST] {process_utils.py:75} INFO - Sending the signal Signals.SIGTERM to group 84
[2022-06-17, 18:06:49 CST] {taskinstance.py:1430} ERROR - Received SIGTERM. Terminating subprocesses.
[2022-06-17, 18:06:49 CST] {taskinstance.py:1774} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/sensors/base.py", line 249, in execute
    time.sleep(self._get_next_poke_interval(started_at, run_duration, try_number))
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1432, in signal_handler
    raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal
[2022-06-17, 18:06:49 CST] {taskinstance.py:1278} INFO - Marking task as FAILED. dag_id=salesforecast-load-init, task_id=load-customer-data-init-sensor, execution_date=20220617T172033, start_date=20220617T175649, end_date=20220617T180649
[2022-06-17, 18:06:49 CST] {standard_task_runner.py:93} ERROR - Failed to execute job 24 for task load-customer-data-init-sensor (Task received SIGTERM signal; 84)
[2022-06-17, 18:06:49 CST] {process_utils.py:70} INFO - Process psutil.Process(pid=84, status='terminated', exitcode=1, started='17:56:48') (84) terminated with exit code 1

Did you try the earlier suggestions with dagrun_timeout? Do you know what is sending SIGTERM to this task?

thank you @potiuk I tried this parameter dagrun_timeout and it didn't work, But in my environment, I commented out these three parameters and it works fine for now

AIRFLOWSCHEDULERJOB_HEARTBEAT_SEC: 600
AIRFLOWSCHEDULERSCHEDULER_HEARTBEAT_SEC: 200
AIRFLOWSCHEDULERSCHEDULER_HEALTH_CHECK_THRESHOLD: 600


airflow:
  config:
    # if other ns, u should config a new sa
    AIRFLOW__KUBERNETES__NAMESPACE: "airflow"
    AIRFLOW__KUBERNETES__DELETE_WORKER_PODS: "false"
    AIRFLOW__WEBSERVER__LOG_FETCH_TIMEOUT_SEC: "15"
    AIRFLOW__LOGGING__LOGGING_LEVEL: "DEBUG"
    AIRFLOW__LOGGING__REMOTE_LOGGING: "True"
    AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER: "s3://airflow-logs/"
    AIRFLOW__LOGGING__REMOTE_LOG_CONN_ID: "openaios_airflow_log"
    AIRFLOW__API__AUTH_BACKEND: "airflow.api.auth.backend.basic_auth"
    #AIRFLOW__SCHEDULER__JOB_HEARTBEAT_SEC: 600
    #AIRFLOW__SCHEDULER__SCHEDULER_HEARTBEAT_SEC: 200
    #AIRFLOW__SCHEDULER__SCHEDULER_HEALTH_CHECK_THRESHOLD: 600
    AIRFLOW__KUBERNETES__WORKER_PODS_QUEUED_CHECK_INTERVAL: "86400"
    AIRFLOW__CORE__KILLED_TASK_CLEANUP_TIME: "604800"
    AIRFLOW__CORE__HOSTNAME_CALLABLE: socket.gethostname
    AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL: "30"
    AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION: "False"

  ## a list of users to create

potiuk commented 2 years ago

I tried this parameter dagrun_timeout and it didn't work, But in my environment, I commented out these three parameters and it works fine for now

the change indicates problem with sheduler healthcheck - which I believe in 2.3.* (currently we are voting on 2.3.3) it was already addressed. I will close it provisionaly. And have a big request - can any of the people who had the problem migrate to 2.3.3 (or even try the 2.3.3rc1 which we are testing here #24806 and remove the configuration to default (@allenhaozi - maybe you can try it).

bdsoha commented 2 years ago

@potiuk I am on version 2.3.3 and am having the same issue described here.

potiuk commented 2 years ago

@potiuk I am on version 2.3.3 and am having the same issue described here.

Then provide an information: logs and analysis and description of your circumstances in a separate issue. It does not bring anyone closer by stating "I have the same issue" without providign any more details that can help with diagnosis of the problem you have. This might be different issue manifesting similarly - but if you do not create a new issue with your symptomps and description you pretty much removes the chance for anyone fixing your problem - because it might be a different one. So if you want to help with diagnosis of the problem - please do your part and report details that might help with the diagnosis.

yannibenoit commented 1 year ago

@potiuk I'm on version 2.3.4, I got a issue on an existing DAGs that's was working fine before with older version (2.1.X) 🤷‍♂️

I tried to update the following variables and I still have the issue :

AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION: 'False'
 AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL: "30"
AIRFLOW__CORE__KILLED_TASK_CLEANUP_TIME: "604800"

and also tried with dagrun_timeout=timedelta(minutes=120)

I don't understand what I'm doing wrong because other dags work fine 😢

Any clue 🙏 ?

dag_id=singer_wootric_stitch_run_id=manual__2022-11-18T09_14_57.025351+00_00_task_id=bash_create_and_init_wootric_venv_attempt=16.log

potiuk commented 1 year ago

I suggest to migrate to latest - 2.4 (or in a few days 2.5) version. There are 100s of related fixes since and it is the easiest way to see if things got better. This is most efficient way for everyone.

yannibenoit commented 1 year ago

@potiuk after migrating to 2.5.0 i still get the issue

potiuk commented 1 year ago

Can you please open a new issue with description of cirscumstances and logs describing when and how it happens,

That ask from above does not change:

Then provide an information: logs and analysis and description of your circumstances in a separate issue. It does not bring anyone closer by stating "I have the same issue" without providign any more details that can help with diagnosis of the problem you have. This might be different issue manifesting similarly - but if you do not create a new issue with your symptomps and description you pretty much removes the chance for anyone fixing your problem - because it might be a different one. So if you want to help with diagnosis of the problem - please do your part and report details that might help with the diagnosis.

potiuk commented 1 year ago

cc: @yannibenoit ^^

yannibenoit commented 1 year ago

@potiuk Thank you for your help

I created an issue but i will resolve it haha 😂 -> Tasks intermittently gets terminated with SIGTERM on Celery Executor · Issue #27885 · apache/airflow

Found a fix after looking at a stack overflow post -> Celery Executor - Airflow Impersonation "run_as_user" Recorded pid xxx does not match the current pid - Stack Overflow

I was running my bash operator with a run_as_user=airflow, i think i don't need anymore

potiuk commented 1 year ago

Ah. I would say that should have been fixed already. Is it possible @yannibenoit - to make an issue and submit some logs from BEFORE the run_as_user was commented out? I guess this might be a problem others might also have and run_as_user is kinda useful.

shaurya-sood commented 1 year ago

Hello, we were experiencing a similar issue on v2.2.5 so we migrated to v2.4.3 but the problem still exists.

[2022-12-07, 15:37:49 UTC] {local_task_job.py:223} WARNING - State of this instance has been externally set to up_for_retry. Terminating instance.
[2022-12-07, 15:37:49 UTC] {process_utils.py:133} INFO - Sending Signals.SIGTERM to group 89412. PIDs of all processes in the group: [89412]
[2022-12-07, 15:37:49 UTC] {process_utils.py:84} INFO - Sending the signal Signals.SIGTERM to group 89412
[2022-12-07, 15:37:49 UTC] {taskinstance.py:1562} ERROR - Received SIGTERM. Terminating subprocesses.

scheduler_heartbeat metric drops to almost 0 during the same time.

We're using Postgres DB and during the DAG execution, the CPU utilization of the DB is spiked up to 100%. (we're using db.r6g.large RDS instance btw)

potiuk commented 1 year ago

@shaurya-sood - can you please (asking it again ) - open a new issue wifh more details - what is your deployment what you are doing, what you experience, more logs, what happens in the UI, whether you use run_as_user, is it happening alwasys or sometimes only, when it hppens etc. It really does not help to add a comment on a closed issue that might get just similar message, but might not necessarily be the same issue.

Thanks in dvance.

shaurya-sood commented 1 year ago

@shaurya-sood - can you please (asking it again ) - open a new issue wifh more details - what is your deployment what you are doing, what you experience, more logs, what happens in the UI, whether you use run_as_user, is it happening alwasys or sometimes only, when it hppens etc. It really does not help to add a comment on a closed issue that might get just similar message, but might not necessarily be the same issue.

Thanks in dvance.

Opened a new issue https://github.com/apache/airflow/issues/28201 Thanks.

apache / airflow