Closed Nimesh-K-Makwana closed 2 years ago
We are also facing the same problem and facing the issue with belo stack trace :
**ERROR - Received SIGTERM. Terminating subprocesses. ERROR - Task failed with exception Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1164, in _run_raw_task self._prepare_and_execute_task_with_callbacks(context, task) File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1282, in _prepare_and_execute_task_with_callbacks result = self._execute_task(context, task_copy) File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1312, in _execute_task result = task_copy.execute(context=context) File "/usr/local/airflow/dags/flow_airflow/operator/dts_operator.py", line 100, in execute self.to_email_address) File "/usr/local/airflow/dags/flow_airflow/operator/dts_operator.py", line 111, in run_process for line in iter(process.stdout.readline, ""): File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1237, in signal_handler raise AirflowException("Task received SIGTERM signal")
WARNING - process psutil.Process(pid=25806, name='python3', status='zombie', started='20:09:10') did not respond to SIGTERM. Trying SIGKILL
{process_utils.py:124} ERROR - Process psutil.Process(pid=54556, name='python3', status='zombie', started='20:09:10') (54556) could not be killed. Giving up.**
We also have thousands of tasks and happens to some intermittently
@GHGHGHKO thanks for the reply. We are seeing the issue on our task pods which even after success are in the failed state in k8s. So the output of this is Error pods in k8s. We control the pod resources using request_memory and limit_memory and limit_cpu. Are you suggesting to increase each task limit to 4GB. That will be huge and our cluster cannot have that amount of resources as we have a lot of tasks that runs parallel. Please let me know.
Possible fix
I was having the same problem after upgrading from Airflow v1.10.15
to v2.2.5
and was seeing the error in long-running DAGs having a fairly high number of tasks.
Apparently, the dagrun_timeout
in airflow.models.DAG
was not respected in earlier Airflow versions so I noticed that the DAGs I was trying to migrate to the new Airflow instance were running for much longer than the specified dagrun_timeout
.
The solution for me was to increase the dagrun_timeout
(e.g. dagrun_timeout=datetime.timedelta(minutes=120)
).
Note that this variable is effective only for scheduled tasks (in other words with DAGs with a specified schedule_interval
).
I have the same problem I'm using airflow 2.2.5, SparkKubernetesOperator and SparkKubernetesSensor
Driver is running But the sensor displays the following logs until the number of retries exceeds the threshold
2022-06-17, 18:05:52 CST] {spark_kubernetes.py:104} INFO - Poking: load-customer-data-init-1655486757.7793136
[2022-06-17, 18:05:52 CST] {spark_kubernetes.py:124} INFO - Spark application is still in state: RUNNING
[2022-06-17, 18:06:49 CST] {local_task_job.py:211} WARNING - State of this instance has been externally set to up_for_retry. Terminating instance.
[2022-06-17, 18:06:49 CST] {process_utils.py:120} INFO - Sending Signals.SIGTERM to group 84. PIDs of all processes in the group: [84]
[2022-06-17, 18:06:49 CST] {process_utils.py:75} INFO - Sending the signal Signals.SIGTERM to group 84
[2022-06-17, 18:06:49 CST] {taskinstance.py:1430} ERROR - Received SIGTERM. Terminating subprocesses.
[2022-06-17, 18:06:49 CST] {taskinstance.py:1774} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/sensors/base.py", line 249, in execute
time.sleep(self._get_next_poke_interval(started_at, run_duration, try_number))
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1432, in signal_handler
raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal
[2022-06-17, 18:06:49 CST] {taskinstance.py:1278} INFO - Marking task as FAILED. dag_id=salesforecast-load-init, task_id=load-customer-data-init-sensor, execution_date=20220617T172033, start_date=20220617T175649, end_date=20220617T180649
[2022-06-17, 18:06:49 CST] {standard_task_runner.py:93} ERROR - Failed to execute job 24 for task load-customer-data-init-sensor (Task received SIGTERM signal; 84)
[2022-06-17, 18:06:49 CST] {process_utils.py:70} INFO - Process psutil.Process(pid=84, status='terminated', exitcode=1, started='17:56:48') (84) terminated with exit code 1
I have the same problem I'm using airflow 2.2.5, SparkKubernetesOperator and SparkKubernetesSensor
Driver is running But the sensor displays the following logs until the number of retries exceeds the threshold
2022-06-17, 18:05:52 CST] {spark_kubernetes.py:104} INFO - Poking: load-customer-data-init-1655486757.7793136 [2022-06-17, 18:05:52 CST] {spark_kubernetes.py:124} INFO - Spark application is still in state: RUNNING [2022-06-17, 18:06:49 CST] {local_task_job.py:211} WARNING - State of this instance has been externally set to up_for_retry. Terminating instance. [2022-06-17, 18:06:49 CST] {process_utils.py:120} INFO - Sending Signals.SIGTERM to group 84. PIDs of all processes in the group: [84] [2022-06-17, 18:06:49 CST] {process_utils.py:75} INFO - Sending the signal Signals.SIGTERM to group 84 [2022-06-17, 18:06:49 CST] {taskinstance.py:1430} ERROR - Received SIGTERM. Terminating subprocesses. [2022-06-17, 18:06:49 CST] {taskinstance.py:1774} ERROR - Task failed with exception Traceback (most recent call last): File "/home/airflow/.local/lib/python3.8/site-packages/airflow/sensors/base.py", line 249, in execute time.sleep(self._get_next_poke_interval(started_at, run_duration, try_number)) File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1432, in signal_handler raise AirflowException("Task received SIGTERM signal") airflow.exceptions.AirflowException: Task received SIGTERM signal [2022-06-17, 18:06:49 CST] {taskinstance.py:1278} INFO - Marking task as FAILED. dag_id=salesforecast-load-init, task_id=load-customer-data-init-sensor, execution_date=20220617T172033, start_date=20220617T175649, end_date=20220617T180649 [2022-06-17, 18:06:49 CST] {standard_task_runner.py:93} ERROR - Failed to execute job 24 for task load-customer-data-init-sensor (Task received SIGTERM signal; 84) [2022-06-17, 18:06:49 CST] {process_utils.py:70} INFO - Process psutil.Process(pid=84, status='terminated', exitcode=1, started='17:56:48') (84) terminated with exit code 1
Did you try the earlier suggestions with dagrun_timeout? Do you know what is sending SIGTERM to this task?
Hi all,
From the discussion over at issue 17507, I may have identified issue when the SIGTERM is sent with the Recorded pid<> does not match the current pid <> error, but I'm running LocalExecutor
and not kubernetes.
For me, I think this is happening when RUN_AS_USER
is set for a task and the heartbeat is checked when the task instance pid is not set (None). In these cases, the recorded_pid
gets set to the parent of running task supervisor process, which is Executor itself, instead of the task runner.
I don't know if this will address the issue with kubernetes or celery executor, but it seems very likely to be the same issue. It will take me a little while to set up the dev environment and do the testing before submitting a PR, but if you want to try doing a local install, feel free to give it a whirl. I have a tentative branch set up here: https://github.com/krcrouse/airflow/tree/fix-pid-check
I have the same problem I'm using airflow 2.2.5, SparkKubernetesOperator and SparkKubernetesSensor Driver is running But the sensor displays the following logs until the number of retries exceeds the threshold
2022-06-17, 18:05:52 CST] {spark_kubernetes.py:104} INFO - Poking: load-customer-data-init-1655486757.7793136 [2022-06-17, 18:05:52 CST] {spark_kubernetes.py:124} INFO - Spark application is still in state: RUNNING [2022-06-17, 18:06:49 CST] {local_task_job.py:211} WARNING - State of this instance has been externally set to up_for_retry. Terminating instance. [2022-06-17, 18:06:49 CST] {process_utils.py:120} INFO - Sending Signals.SIGTERM to group 84. PIDs of all processes in the group: [84] [2022-06-17, 18:06:49 CST] {process_utils.py:75} INFO - Sending the signal Signals.SIGTERM to group 84 [2022-06-17, 18:06:49 CST] {taskinstance.py:1430} ERROR - Received SIGTERM. Terminating subprocesses. [2022-06-17, 18:06:49 CST] {taskinstance.py:1774} ERROR - Task failed with exception Traceback (most recent call last): File "/home/airflow/.local/lib/python3.8/site-packages/airflow/sensors/base.py", line 249, in execute time.sleep(self._get_next_poke_interval(started_at, run_duration, try_number)) File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1432, in signal_handler raise AirflowException("Task received SIGTERM signal") airflow.exceptions.AirflowException: Task received SIGTERM signal [2022-06-17, 18:06:49 CST] {taskinstance.py:1278} INFO - Marking task as FAILED. dag_id=salesforecast-load-init, task_id=load-customer-data-init-sensor, execution_date=20220617T172033, start_date=20220617T175649, end_date=20220617T180649 [2022-06-17, 18:06:49 CST] {standard_task_runner.py:93} ERROR - Failed to execute job 24 for task load-customer-data-init-sensor (Task received SIGTERM signal; 84) [2022-06-17, 18:06:49 CST] {process_utils.py:70} INFO - Process psutil.Process(pid=84, status='terminated', exitcode=1, started='17:56:48') (84) terminated with exit code 1
Did you try the earlier suggestions with dagrun_timeout? Do you know what is sending SIGTERM to this task?
thank you @potiuk
I tried this parameter dagrun_timeout
and it didn't work,
But in my environment, I commented out these three parameters and it works fine for now
airflow:
config:
# if other ns, u should config a new sa
AIRFLOW__KUBERNETES__NAMESPACE: "airflow"
AIRFLOW__KUBERNETES__DELETE_WORKER_PODS: "false"
AIRFLOW__WEBSERVER__LOG_FETCH_TIMEOUT_SEC: "15"
AIRFLOW__LOGGING__LOGGING_LEVEL: "DEBUG"
AIRFLOW__LOGGING__REMOTE_LOGGING: "True"
AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER: "s3://airflow-logs/"
AIRFLOW__LOGGING__REMOTE_LOG_CONN_ID: "openaios_airflow_log"
AIRFLOW__API__AUTH_BACKEND: "airflow.api.auth.backend.basic_auth"
#AIRFLOW__SCHEDULER__JOB_HEARTBEAT_SEC: 600
#AIRFLOW__SCHEDULER__SCHEDULER_HEARTBEAT_SEC: 200
#AIRFLOW__SCHEDULER__SCHEDULER_HEALTH_CHECK_THRESHOLD: 600
AIRFLOW__KUBERNETES__WORKER_PODS_QUEUED_CHECK_INTERVAL: "86400"
AIRFLOW__CORE__KILLED_TASK_CLEANUP_TIME: "604800"
AIRFLOW__CORE__HOSTNAME_CALLABLE: socket.gethostname
AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL: "30"
AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION: "False"
## a list of users to create
I tried this parameter dagrun_timeout and it didn't work, But in my environment, I commented out these three parameters and it works fine for now
the change indicates problem with sheduler healthcheck - which I believe in 2.3.* (currently we are voting on 2.3.3) it was already addressed. I will close it provisionaly. And have a big request - can any of the people who had the problem migrate to 2.3.3 (or even try the 2.3.3rc1 which we are testing here #24806 and remove the configuration to default (@allenhaozi - maybe you can try it).
@potiuk I am on version 2.3.3 and am having the same issue described here.
@potiuk I am on version 2.3.3 and am having the same issue described here.
Then provide an information: logs and analysis and description of your circumstances in a separate issue. It does not bring anyone closer by stating "I have the same issue" without providign any more details that can help with diagnosis of the problem you have. This might be different issue manifesting similarly - but if you do not create a new issue with your symptomps and description you pretty much removes the chance for anyone fixing your problem - because it might be a different one. So if you want to help with diagnosis of the problem - please do your part and report details that might help with the diagnosis.
@potiuk I'm on version 2.3.4, I got a issue on an existing DAGs that's was working fine before with older version (2.1.X) 🤷♂️
I tried to update the following variables and I still have the issue :
AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION: 'False'
AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL: "30"
AIRFLOW__CORE__KILLED_TASK_CLEANUP_TIME: "604800"
and also tried with dagrun_timeout=timedelta(minutes=120)
I don't understand what I'm doing wrong because other dags work fine 😢
I suggest to migrate to latest - 2.4 (or in a few days 2.5) version. There are 100s of related fixes since and it is the easiest way to see if things got better. This is most efficient way for everyone.
@potiuk after migrating to 2.5.0 i still get the issue
Can you please open a new issue with description of cirscumstances and logs describing when and how it happens,
That ask from above does not change:
Then provide an information: logs and analysis and description of your circumstances in a separate issue. It does not bring anyone closer by stating "I have the same issue" without providign any more details that can help with diagnosis of the problem you have. This might be different issue manifesting similarly - but if you do not create a new issue with your symptomps and description you pretty much removes the chance for anyone fixing your problem - because it might be a different one. So if you want to help with diagnosis of the problem - please do your part and report details that might help with the diagnosis.
cc: @yannibenoit ^^
@potiuk Thank you for your help
I created an issue but i will resolve it haha 😂 -> Tasks intermittently gets terminated with SIGTERM on Celery Executor · Issue #27885 · apache/airflow
Found a fix after looking at a stack overflow post -> Celery Executor - Airflow Impersonation "run_as_user" Recorded pid xxx does not match the current pid - Stack Overflow
I was running my bash operator with a run_as_user=airflow
, i think i don't need anymore
Ah. I would say that should have been fixed already. Is it possible @yannibenoit - to make an issue and submit some logs from BEFORE the run_as_user was commented out? I guess this might be a problem others might also have and run_as_user is kinda useful.
Hello, we were experiencing a similar issue on v2.2.5 so we migrated to v2.4.3 but the problem still exists.
[2022-12-07, 15:37:49 UTC] {local_task_job.py:223} WARNING - State of this instance has been externally set to up_for_retry. Terminating instance.
[2022-12-07, 15:37:49 UTC] {process_utils.py:133} INFO - Sending Signals.SIGTERM to group 89412. PIDs of all processes in the group: [89412]
[2022-12-07, 15:37:49 UTC] {process_utils.py:84} INFO - Sending the signal Signals.SIGTERM to group 89412
[2022-12-07, 15:37:49 UTC] {taskinstance.py:1562} ERROR - Received SIGTERM. Terminating subprocesses.
scheduler_heartbeat
metric drops to almost 0 during the same time.
We're using Postgres DB and during the DAG execution, the CPU utilization of the DB is spiked up to 100%. (we're using db.r6g.large
RDS instance btw)
@shaurya-sood - can you please (asking it again ) - open a new issue wifh more details - what is your deployment what you are doing, what you experience, more logs, what happens in the UI, whether you use run_as_user, is it happening alwasys or sometimes only, when it hppens etc. It really does not help to add a comment on a closed issue that might get just similar message, but might not necessarily be the same issue.
Thanks in dvance.
@shaurya-sood - can you please (asking it again ) - open a new issue wifh more details - what is your deployment what you are doing, what you experience, more logs, what happens in the UI, whether you use run_as_user, is it happening alwasys or sometimes only, when it hppens etc. It really does not help to add a comment on a closed issue that might get just similar message, but might not necessarily be the same issue.
Thanks in dvance.
Opened a new issue https://github.com/apache/airflow/issues/28201 Thanks.
Apache Airflow version
2.1.3 (latest released)
Operating System
Linux
Versions of Apache Airflow Providers
No response
Deployment
Other
Deployment details
Have tried env variables as given in this github issue issues/14672: AIRFLOWCOREKILLED_TASK_CLEANUP_TIME: "604800" AIRFLOWSCHEDULERSCHEDULE_AFTER_TASK_EXECUTION: "False"
What happened
[2021-09-04 10:28:50,536] {local_task_job.py:80} ERROR - Received SIGTERM. Terminating subprocesses [2021-09-04 10:28:50,536] {process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 33 [2021-09-04 10:28:50,537] {taskinstance.py:1235} ERROR - Received SIGTERM. Terminating subprocesses. [2021-09-04 10:28:52,568] {taskinstance.py:1462} ERROR - Task failed with exception Traceback (most recent call last): File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1164, in _run_raw_task self._prepare_and_execute_task_with_callbacks(context, task) File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1282, in _prepare_and_execute_task_with_callbacks result = self._execute_task(context, task_copy) File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1307, in _execute_task result = task_copy.execute(context=context) File "/home/airflow/.local/lib/python3.8/site-packages/airflow/operators/python.py", line 150, in execute return_value = self.execute_callable() File "/home/airflow/.local/lib/python3.8/site-packages/airflow/operators/python.py", line 161, in execute_callable return self.python_callable(*self.op_args, **self.op_kwargs) File "/opt/airflow/dags/repo/dags/elastit_schedular/waiting_task_processor.py", line 59, in trigger_task time.sleep(1) File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1237, in signal_handler raise AirflowException("Task received SIGTERM signal") airflow.exceptions.AirflowException: Task received SIGTERM signal
What you expected to happen
Dag must get executed successfully without any sigterm signal.
How to reproduce
No response
Anything else
No response
Are you willing to submit PR?
Code of Conduct