apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.76k stars 14.22k forks source link

Intermittent SIGTERM running on K8S #32533

Closed mschueler closed 1 year ago

mschueler commented 1 year ago

Apache Airflow version

Other Airflow 2 version (please specify below)

What happened

We are seeing intermittent SIGTERMs on DAGs. There seems to be no rhyme or reason to the SIGTERMs (e.g. seems to happen to all our DAGs at some time or another, no pattern to the timing, etc)

The deploy is thru Helm chart to an EKS cluster running on EKS. It's happening in our nonprod and prod clusters both. We've tried different things in our nonprod environment to fix it, basically following ideas we found from Google searches (increasing resources, upgrading airflow version, checking logs [we've found nothing useful in the logs but will post as much info as I can here], increasing timeouts, and trying some settings* we found mentioned in other GitHub issues.

` - name: AIRFLOWCOREKILLED_TASK_CLEANUP_TIME value: "3600"

Nonprod: EKS 1.26 / Airflow 2.5.1.

Prod: EKS 1.25 / Airflow 2.2.4

Focusing efforts on Nonprod but just wanted to mention we're seeing the issue on multiple versions. Also, believe the original version we started on was 2.0.x something but we've been struggling with this issue since January (when we first started to setup Airflow 2.0 on k8s). As a workaround we are doing a retry where possible.

This is the exact error:

airflow.exceptions.AirflowException: Task received SIGTERM signal

Would truly appreciate any help or insight into what we're doing wrong. I've tried to put as much information below as possible but if I'm missing something, please let me know.

Helm values file:

#` User and group of airflow user
airflowHome: /opt/airflow
airflowPodAnnotations:
  ad.datadoghq.com/tolerate-unready: "true"
  ad.datadoghq.com/webserver.check_names: '["airflow"]'
  ad.datadoghq.com/webserver.init_configs: "[{}]"
  ad.datadoghq.com/webserver.instances: '[{"url": "https://airflow.dev.eks.xxxx.com"}]'
  ad.datadoghq.com/webserver.logs: '[{"source":"airflow", "service": "airflow"}]'

defaultAirflowRepository: apache/airflow
defaultAirflowTag: 2.5.2
airflowVersion: 2.5.2

##########################################
## COMPONENT | Airflow images and gitsync
##########################################
images:
  airflow:
    pullPolicy: IfNotPresent
    repository: 007601687147.dkr.ecr.us-east-1.amazonaws.com/airflow
    tag: "33-dev"
  gitSync:
    pullPolicy: IfNotPresent
    repository: k8s.gcr.io/git-sync/git-sync
    tag: v3.3.0
  pgbouncer:
    pullPolicy: IfNotPresent
    repository: apache/airflow
    tag: airflow-pgbouncer-2021.04.28-1.14.0
  pgbouncerExporter:
    pullPolicy: IfNotPresent
    repository: apache/airflow
    tag: airflow-pgbouncer-exporter-2021.09.22-0.12.0
  pod_template:
    pullPolicy: IfNotPresent
    repository: null
    tag: null
  statsd:
    pullPolicy: IfNotPresent
    repository: apache/airflow
    tag: airflow-statsd-exporter-2021.04.28-v0.17.0

##########################################
## COMPONENT | Load balancer configs
##########################################
ingress:
  enabled: true
  web:
    annotations:
      kubernetes.io/ingress.class: alb
      alb.ingress.kubernetes.io/scheme: internal
      alb.ingress.kubernetes.io/target-type: ip
      alb.ingress.kubernetes.io/success-codes: 200,302
      alb.ingress.kubernetes.io/inbound-cidrs: 10.0.0.0/8,172.16.0.0/12,192.168.0.0/16
      alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]'
      alb.ingress.kubernetes.io/actions.ssl-redirect: '{"Type": "redirect", "RedirectConfig": { "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}'
      alb.ingress.kubernetes.io/tags: environment=dev,team=techopsdevops@xxxx.com,business_app=eks-cluster,Name=airflow-ingress
      alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:xxxx:certificate/a26e6adb-9e75-4048-baa1-8ae08e2f8dd4
    path: "/*"
    pathType: "ImplementationSpecific"
    hosts:
      - airflow.dev.eks.xxxxx.com
    precedingPaths:
      - path: "/*"
        serviceName: "ssl-redirect"
        servicePort: "use-annotation"
        pathType: "ImplementationSpecific"
    succeedingPaths: []
    tls:
      enabled: false
      secretName: ""

# `airflow_local_settings` file as a string (can be templated).
airflowLocalSettings: null

# Enable RBAC (default on most clusters these days)
rbac:
  create: true

# Airflow executor
executor: KubernetesExecutor

# Environment variables for all airflow containers
env:
  - name: AIRFLOW__LOGGING__FAB_LOGGING_LEVEL
    value: DEBUG

allowPodLaunching: true

# Custom secrets
extraSecrets:
  airflow-ssh-secret:
    data: |
      gitSshKey: 'xxxx'
  airflow-db:
    data: |
      connection: 'xxxxxx'

# Extra env 'items' that will be added to the definition of airflow containers
extraEnv: |-
  - name: AIRFLOW__METRICS__STATSD_HOST
    valueFrom:
      fieldRef:
        fieldPath: status.hostIP
  - name: AWS_DEFAULT_REGION
    value: us-east-1
  - name: AIRFLOW__LOGGING__FAB_LOGGING_LEVEL
    value: INFO
  - name: AIRFLOW__CORE__KILLED_TASK_CLEANUP_TIME
    value: "3600"
  - name: AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION
    value: "false"

# Airflow database config
data:
  metadataSecretName: airflow-db

# Fernet key settings
# Note: fernetKey can only be set during install, not upgrade
fernetKey: null
fernetKeySecretName: null

###################################
## COMPONENT | Airflow Workers
###################################
workers:
  persistence:
    enabled: false
    fixPermissions: false
  nodeSelector:
    node.kubernetes.io/instance-type: c5d.2xlarge
  tolerations:
    - effect: NoSchedule
      key: allowed_jobs
      value: airflow
      operator: Equal
  resources:
    limits:
      memory: 4000Mi
    requests:
      memory: 4000Mi
  replicas: 1
  safeToEvict: true
  serviceAccount:
    create: true
    name: null
    annotations:
      eks.amazonaws.com/role-arn: arn:aws:iam::xxxxx:role/airflow-eks-devops-dev-s3-irsa
  strategy:
    rollingUpdate:
      maxSurge: 100%
      maxUnavailable: 50%
  updateStrategy: null
  extraVolumes:
    - name: temp-worker-data
      persistentVolumeClaim:
        claimName: airflow-temp-workers-efs-claim
  extraVolumeMounts:
    - name: temp-worker-data
      mountPath: /opt/airflow/worker_data/

###################################
## COMPONENT | Airflow Scheduler
###################################
scheduler:
  livenessProbe:
    failureThreshold: 5
    initialDelaySeconds: 10
    periodSeconds: 60
    timeoutSeconds: 20
  nodeSelector:
    node.kubernetes.io/instance-type: t3.2xlarge
  podDisruptionBudget:
    config:
      maxUnavailable: 1
    enabled: false
  replicas: 1
  safeToEvict: true
  serviceAccount:
    create: true
    name: null
    annotations:
      eks.amazonaws.com/role-arn: arn:aws:iam::xxxxx:role/airflow-eks-devops-dev-s3-irsa
  extraVolumes:
    - name: temp-worker-data
      persistentVolumeClaim:
        claimName: airflow-temp-workers-efs-claim
  extraVolumeMounts:
    - name: temp-worker-data
      mountPath: /opt/airflow/worker_data/

###################################
## COMPONENT | Airflow Webserver
###################################
webserver:
  allowPodLogReading: true
  defaultUser:
    email: admin@xxx.com
    enabled: true
    firstName: admin1
    lastName: user
    password: xxxx
    role: Admin
    username: admin
  livenessProbe:
    initialDelaySeconds: 15
    timeoutSeconds: 30
    failureThreshold: 20
    periodSeconds: 5
  readinessProbe:
    failureThreshold: 20
    initialDelaySeconds: 15
    periodSeconds: 5
    timeoutSeconds: 30
  serviceAccount:
    create: true
    name: ~
    annotations:
  replicas: 2
  service:
    type: ClusterIP
    ports:
      - name: airflow-ui
        port: 80
        targetPort: airflow-ui
  strategy: null
  webserverConfig: |-
    import os
    from airflow import configuration as conf
    from flask_appbuilder.security.manager import AUTH_LDAP

    CSRF_ENABLED = True
    AUTH_TYPE = AUTH_LDAP
    AUTH_LDAP_SERVER = "ldap://ldap-aws.xxxx.com:389"
    AUTH_LDAP_USE_TLS = False

    AUTH_USER_REGISTRATION = True
    AUTH_USER_REGISTRATION_ROLE = "Admin"
    AUTH_LDAP_FIRSTNAME_FIELD = "givenName"
    AUTH_LDAP_LASTNAME_FIELD = "sn"
    AUTH_LDAP_EMAIL_FIELD = "mail"

    AUTH_LDAP_USERNAME_FORMAT = "xxx"
    AUTH_LDAP_SEARCH = "xxxx"
    AUTH_LDAP_UID_FIELD = "SamAccountName"
    AUTH_LDAP_SEARCH_FILTER = "xxxxxxx"
    AUTH_LDAP_GROUP_FIELD = "memberOf"

    AUTH_ROLES_SYNC_AT_LOGIN = True
    PERMANENT_SESSION_LIFETIME = 600

# Overriding airflow flower
flower:
  enabled: false

# Overriding airflow statsd
statsd:
  enabled: false

##########################################
## COMPONENT | PgBouncer
##########################################
pgbouncer:
  ciphers: normal
  configSecretName: null
  enabled: true
  logConnections: 1
  logDisconnections: 1
  maxClientConn: 100
  metadataPoolSize: 10
  podDisruptionBudget:
    config:
      maxUnavailable: 1
    enabled: false
  resultBackendPoolSize: 5
  serviceAccount:
    create: true
    name: null
  ssl:
    ca: null
    cert: null
    key: null
  sslmode: prefer

# Overriding redis config
redis:
  enabled: false

# All ports used by chart
ports:
  airflowUI: 8080
  pgbouncer: 6543
  pgbouncerScrape: 9127
  statsdIngest: 8125
  workerLogs: 8793

# This runs as a CronJob to cleanup old pods
cleanup:
  enabled: true
  schedule: "*/15 * * * *"
  serviceAccount:
    create: true
    name: airflow

# Overriding postgres config
postgresql:
  enabled: false

# Config settings to go into the mounted airflow.cfg
config:
  core:
    load_examples: "False"
    load_default_connections: "False"
    parallelism: 300
    default_pool_task_slot_count: 300
    max_active_tasks_per_dag: 100
    max_active_runs_per_dag: 1
    executor: "{{ .Values.executor }}"
    remote_logging: "True"
    dagbag_import_timeout: 60
  email:
    email_backend: airflow.utils.email.send_email_smtp
  smtp:
    smtp_host: mailhost.dynata.com
    smtp_starttls: False
    smtp_ssl: False
    smtp_port: 25
    smtp_mail_from: airflow@dynata.com
  logging:
    colored_console_log: "False"
    remote_logging: "True"
    remote_base_log_folder: s3://airflow-dev-eks
    remote_log_conn_id: aws_default
  metrics:
    statsd_on: true
    statsd_port: 8125
    statsd_prefix: airflow
  webserver:
    base_url: https://airflow.dev.eks.dynata.com
  scheduler:
    enable_health_check: "True"
  kubernetes:
    worker_pods_creation_batch_size: 100

# Overriding pod template
podTemplate: null

###################################
## COMPONENT | Airflow dags
###################################
dags:
  gitSync:
    # branch: airflow-v2-testin
    branch: main
    containerName: git-sync
    depth: 1
    enabled: true
    env: []
    extraVolumeMounts: []
    maxFailures: 0
    repo: git@github.com:dynata/airflow.git
    rev: HEAD
    sshKeySecret: airflow-ssh-secret
    # subPath: dags_v2
    subPath: dags
    uid: 50000
    wait: 30
  persistence:
    enabled: false

# Overriding logs
logs:
  persistence:
    enabled: false

resulting Airflow.cfg (configmap)

[celery]
flower_url_prefix = /
worker_concurrency = 16

[celery_kubernetes_executor]
kubernetes_queue = kubernetes

[core]
colored_console_log = False
dagbag_import_timeout = 60
dags_folder = /opt/airflow/dags/repo/dags
default_pool_task_slot_count = 300
executor = KubernetesExecutor
load_default_connections = False
load_examples = False
max_active_runs_per_dag = 1
max_active_tasks_per_dag = 100
parallelism = 300
remote_logging = True

[elasticsearch]
json_format = True
log_id_template = {dag_id}_{task_id}_{execution_date}_{try_number}

[elasticsearch_configs]
max_retries = 3
retry_timeout = True
timeout = 30

[email]
email_backend = airflow.utils.email.send_email_smtp

[kerberos]
ccache = /var/kerberos-ccache/cache
keytab = /etc/airflow.keytab
principal = airflow@FOO.COM
reinit_frequency = 3600

[kubernetes]
airflow_configmap = airflow-airflow-config
airflow_local_settings_configmap = airflow-airflow-config
multi_namespace_mode = False
namespace = data-platform
pod_template_file = /opt/airflow/pod_templates/pod_template_file.yaml
worker_container_repository = xxxxx.dkr.ecr.us-east-1.amazonaws.com/airflow
worker_container_tag = 33-dev
worker_pods_creation_batch_size = 100

[logging]
colored_console_log = False
remote_base_log_folder = s3://airflow-dev-xxxx
remote_log_conn_id = aws_default
remote_logging = True

[metrics]
statsd_host = airflow-statsd
statsd_on = true
statsd_port = 8125
statsd_prefix = airflow

[scheduler]
enable_health_check = True
run_duration = 41460
standalone_dag_processor = False
statsd_host = airflow-statsd
statsd_on = False
statsd_port = 9125
statsd_prefix = airflow

[smtp]
smtp_host = mailhost.xxx.com
smtp_mail_from = airflow@xxx.com
smtp_port = 25
smtp_ssl = false
smtp_starttls = false

[webserver]
base_url = https://airflow.dev.eks.xxxx.com
enable_proxy_fix = True
rbac = True

What you think should happen instead

No response

How to reproduce

Intermittent. Schedule a DAG run.

Operating System

Kubernetes -- DAGs running on image based on Debian Bullseye

Versions of Apache Airflow Providers

apache-airflow-providers-amazon 7.3.0 Amazon integration (including Amazon Web Services (AWS)).
apache-airflow-providers-celery 3.1.0 Celery
apache-airflow-providers-cncf-kubernetes 5.2.2 Kubernetes
apache-airflow-providers-common-sql 1.3.4 Common SQL Provider
apache-airflow-providers-datadog 2.0.4 Datadog
apache-airflow-providers-docker 3.5.1 Docker
apache-airflow-providers-elasticsearch 4.4.0 Elasticsearch
apache-airflow-providers-ftp 3.3.1 File Transfer Protocol (FTP)
apache-airflow-providers-google 8.11.0 Google services including: - Google Ads - Google Cloud (GCP) - Google Firebase - Google LevelDB - Google Marketing Platform - Google Workspace (formerly Google Suite)
apache-airflow-providers-grpc 3.1.0 gRPC
apache-airflow-providers-hashicorp 3.3.0 Hashicorp including Hashicorp Vault
apache-airflow-providers-http 4.2.0 Hypertext Transfer Protocol (HTTP)
apache-airflow-providers-imap 3.1.1 Internet Message Access Protocol (IMAP)
apache-airflow-providers-microsoft-azure 5.2.1 Microsoft Azure
apache-airflow-providers-microsoft-mssql 2.1.3 Microsoft SQL Server (MSSQL)
apache-airflow-providers-mysql 2.2.3 MySQL
apache-airflow-providers-odbc 3.2.1 ODBC
apache-airflow-providers-oracle 2.2.3 Oracle
apache-airflow-providers-postgres 5.4.0 PostgreSQL
apache-airflow-providers-redis 3.1.0 Redis
apache-airflow-providers-sendgrid 3.1.0 Sendgrid
apache-airflow-providers-sftp 2.6.0 SSH File Transfer Protocol (SFTP)
apache-airflow-providers-slack 7.2.0 Slack
apache-airflow-providers-snowflake 2.1.1 Snowflake
apache-airflow-providers-sqlite 3.3.1 SQLite
apache-airflow-providers-ssh 3.5.0 Secure Shell (SSH)
apache-airflow-providers-tableau 2.1.8 Tableau

apache-airflow-providers-amazon 7.3.0 Amazon integration (including Amazon Web Services (AWS)). apache-airflow-providers-celery 3.1.0 Celery apache-airflow-providers-cncf-kubernetes 5.2.2 Kubernetes apache-airflow-providers-common-sql 1.3.4 Common SQL Provider apache-airflow-providers-datadog 2.0.4 Datadog apache-airflow-providers-docker 3.5.1 Docker apache-airflow-providers-elasticsearch 4.4.0 Elasticsearch apache-airflow-providers-ftp 3.3.1 File Transfer Protocol (FTP) apache-airflow-providers-google 8.11.0 Google services including:

Deployment

Official Apache Airflow Helm Chart

Deployment details

EKS 1.25 running Karpenter (cluster autoscaler replacement)

Anything else

Intermittent -- 5 - 50x a day

Are you willing to submit PR?

Code of Conduct

boring-cyborg[bot] commented 1 year ago

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.