We are seeing intermittent SIGTERMs on DAGs. There seems to be no rhyme or reason to the SIGTERMs (e.g. seems to happen to all our DAGs at some time or another, no pattern to the timing, etc)
The deploy is thru Helm chart to an EKS cluster running on EKS. It's happening in our nonprod and prod clusters both. We've tried different things in our nonprod environment to fix it, basically following ideas we found from Google searches (increasing resources, upgrading airflow version, checking logs [we've found nothing useful in the logs but will post as much info as I can here], increasing timeouts, and trying some settings* we found mentioned in other GitHub issues.
Focusing efforts on Nonprod but just wanted to mention we're seeing the issue on multiple versions. Also, believe the original version we started on was 2.0.x something but we've been struggling with this issue since January (when we first started to setup Airflow 2.0 on k8s). As a workaround we are doing a retry where possible.
This is the exact error:
airflow.exceptions.AirflowException: Task received SIGTERM signal
Would truly appreciate any help or insight into what we're doing wrong. I've tried to put as much information below as possible but if I'm missing something, please let me know.
Kubernetes -- DAGs running on image based on Debian Bullseye
Versions of Apache Airflow Providers
apache-airflow-providers-amazon
7.3.0
Amazon integration (including Amazon Web Services (AWS)).
apache-airflow-providers-celery
3.1.0
Celery
apache-airflow-providers-cncf-kubernetes
5.2.2
Kubernetes
apache-airflow-providers-common-sql
1.3.4
Common SQL Provider
apache-airflow-providers-datadog
2.0.4
Datadog
apache-airflow-providers-docker
3.5.1
Docker
apache-airflow-providers-elasticsearch
4.4.0
Elasticsearch
apache-airflow-providers-ftp
3.3.1
File Transfer Protocol (FTP)
apache-airflow-providers-google
8.11.0
Google services including: - Google Ads - Google Cloud (GCP) - Google Firebase - Google LevelDB - Google Marketing Platform - Google Workspace (formerly Google Suite)
Apache Airflow version
Other Airflow 2 version (please specify below)
What happened
We are seeing intermittent SIGTERMs on DAGs. There seems to be no rhyme or reason to the SIGTERMs (e.g. seems to happen to all our DAGs at some time or another, no pattern to the timing, etc)
The deploy is thru Helm chart to an EKS cluster running on EKS. It's happening in our nonprod and prod clusters both. We've tried different things in our nonprod environment to fix it, basically following ideas we found from Google searches (increasing resources, upgrading airflow version, checking logs [we've found nothing useful in the logs but will post as much info as I can here], increasing timeouts, and trying some settings* we found mentioned in other GitHub issues.
` - name: AIRFLOWCOREKILLED_TASK_CLEANUP_TIME value: "3600"
Nonprod: EKS 1.26 / Airflow 2.5.1.
Prod: EKS 1.25 / Airflow 2.2.4
Focusing efforts on Nonprod but just wanted to mention we're seeing the issue on multiple versions. Also, believe the original version we started on was 2.0.x something but we've been struggling with this issue since January (when we first started to setup Airflow 2.0 on k8s). As a workaround we are doing a retry where possible.
This is the exact error:
airflow.exceptions.AirflowException: Task received SIGTERM signal
Would truly appreciate any help or insight into what we're doing wrong. I've tried to put as much information below as possible but if I'm missing something, please let me know.
Helm values file:
resulting Airflow.cfg (configmap)
What you think should happen instead
No response
How to reproduce
Intermittent. Schedule a DAG run.
Operating System
Kubernetes -- DAGs running on image based on Debian Bullseye
Versions of Apache Airflow Providers