apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
37.47k stars 14.37k forks source link

Airflow webserver stops responding and then restarted by liveness probes #22307

Closed hussainsaify closed 2 years ago

hussainsaify commented 2 years ago

Apache Airflow version

2.2.4 (latest released)

What happened

Hi Team,

After upgrading to airflow 2.2.3, we have started facing an issue where webserserver gets stuck and is eventually restarted after failing the liveness probes. However, this issue is transient and we see it 2-3 times per week. We also see high spike of cpu during the time issue occured.

Cloud - AWS EKS 1.21 Helm Chart - 1.4.0 Current airflow version- 2.2.4

Please let me know in case you need more details

Thanks, Hussain

What you think should happen instead

Webserver should continue running and responds to requests.

How to reproduce

The issue is transient, occurs 2-3 times per week.

Operating System

Debian GNU/Linux 10 (buster)

Versions of Apache Airflow Providers

apache-airflow-providers-amazon==3.0.0 apache-airflow-providers-apache-hive==2.2.0 apache-airflow-providers-celery==2.1.0 apache-airflow-providers-cncf-kubernetes==3.0.2 apache-airflow-providers-docker==2.4.1 apache-airflow-providers-elasticsearch==2.2.0 apache-airflow-providers-ftp==2.0.1 apache-airflow-providers-google==6.4.0 apache-airflow-providers-grpc==2.0.1 apache-airflow-providers-hashicorp==2.1.1 apache-airflow-providers-http==2.0.3 apache-airflow-providers-imap==2.2.0 apache-airflow-providers-microsoft-azure==3.6.0 apache-airflow-providers-microsoft-mssql==2.1.0 apache-airflow-providers-mysql==2.2.0 apache-airflow-providers-odbc==2.0.1 apache-airflow-providers-postgres==3.0.0 apache-airflow-providers-redis==2.0.1 apache-airflow-providers-sendgrid==2.0.1 apache-airflow-providers-sftp==2.4.1 apache-airflow-providers-slack==4.2.0 apache-airflow-providers-sqlite==2.1.0 apache-airflow-providers-ssh==2.4.0

Deployment

Official Apache Airflow Helm Chart

Deployment details

Please find below helm values.

defaultAirflowTag: "2.2.4"

airflowVersion: "2.2.4"

labels:
  spotinst.io/restrict-scale-down: "true"
nodeSelector:
  node-class: worker

fernetKeySecretName: airflow-fernet-key

webserverSecretKeySecretName: airflow-webserver-secret-key

ingress:
  enabled: true
  web:
    host: "airflow.************.com"
    annotations:
      kubernetes.io/ingress.class: "nginx"
    hosts:
      - name: "airflow.**************.com"
        tls:
          enabled: true
          secretName: "tls-wildcard"

  # Configs for the Ingress of the flower Service
  flower:

    path: "/flower"

    pathType: "ImplementationSpecific"

    host: "airflow.************.com"
    annotations:
      kubernetes.io/ingress.class: "nginx"

    hosts:
      - name: "airflow.***********.com"
        tls:
          enabled: true
          secretName: "tls-wildcard"

airflowPodAnnotations:
  ad.datadoghq.com/airflow-web.check_names: '["airflow"]'
  ad.datadoghq.com/airflow-web.init_configs: '[{}]'
  ad.datadoghq.com/airflow-web.instances: |
    [
      {
        "url": "http://%%host%%:8080"
      }
    ]
executor: "KubernetesExecutor"

extraEnv: |
  - name: AIRFLOW__CORE__FERNET_KEY
    valueFrom:
      secretKeyRef:
        name: airflow-fernet-key
        key: fernet-key
  - name: AZURE_TENANT_ID
    valueFrom:
      secretKeyRef:
        name: airflow-azuread-creds
        key: tenant_id
  - name: AZURE_CLIENT_SECRET
    valueFrom:
      secretKeyRef:
        name: airflow-azuread-creds
        key: client_secret
  - name: AZURE_CLIENT_ID
    valueFrom:
      secretKeyRef:
        name: airflow-azuread-creds
        key: client_id
  - name: AIRFLOW__METRICS__STATSD_HOST
    valueFrom:
      fieldRef:
        fieldPath: status.hostIP
env:
  - name: AIRFLOW_CONN_AWS_LOG
    value: "aws://"
  - name: AIRFLOW__CORE__REMOTE_LOGGING
    value: "True"
  - name: AIRFLOW__CORE__REMOTE_BASE_LOG_FOLDER
    value: "s3://***********/airflow-etl/logs"
  - name: AIRFLOW__CORE__REMOTE_LOG_CONN_ID
    value: "aws_development"
  - name: AIRFLOW__CORE__ENCRYPT_S3_LOGS
    value: "True"
  - name: AIRFLOW__CORE__PARALLELISM
    value: "64"
  - name: AIRFLOW__CORE__DAG_CONCURRENCY
    value: "32"
  - name: AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG
    value: "32"
  - name: AIRFLOW__CORE__SQL_ALCHEMY_POOL_SIZE
    value: "10"
  - name: AIRFLOW__CORE__SQL_ALCHEMY_MAX_OVERFLOW
    value: "30"
  - name: AIRFLOW__SMTP__SMTP_HOST
    value: "smtp.mail-relay.svc"
  - name: AIRFLOW__SMTP__SMTP_MAIL_FROM
    value: "***********"
  - name: AIRFLOW__METRICS__STATSD_ON
    value: "True"
  - name: AIRFLOW__METRICS__STATSD_PORT
    value: "8125"
  - name: AIRFLOW__METRICS__STATSD_PREFIX
    value: "airflow"
  - name: AIRFLOW__WEBSERVER__AUTHENTICATE
    value: "True"
  - name: AIRFLOW__WEBSERVER__EXPOSE_CONFIG
    value: "True"
  - name: AIRFLOW__WEBSERVER__RBAC
    value: "True"
  - name: AIRFLOW__WEBSERVER__ENABLE_PROXY_FIX
    value: "True"
  - name: AIRFLOW__LOGGING__FAB_LOGGING_LEVEL
    value: "WARN"
  - name: AIRFLOW__LOGGING__LOGGING_LEVEL
    value: "DEBUG"
  - name: AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT
    value: "240.0"
  - name: AIRFLOW__CELERY__FLOWER_URL_PREFIX
    value: "/flower"
  - name: AIRFLOW__API__AUTH_BACKEND
    value: "airflow.api.auth.backend.basic_auth"
  - name: AIRFLOW__CORE__HOSTNAME_CALLABLE
    value: "socket:gethostname"
  - name: AIRFLOW__WEBSERVER__WORKER_REFRESH_BATCH_SIZE
    value: "0"
  - name: AIRFLOW__WEBSERVER__WORKER_REFRESH_INTERVAL
    value: "0"

data:
  metadataSecretName: airflow-metadata-connection

# Airflow scheduler settings
scheduler:
  replicas: 2
  serviceAccount:
    create: false
    name: airflow
  # Scheduler pod disruption budget
  podDisruptionBudget:
    enabled: true
    config:
      maxUnavailable: 1

  resources:
    limits:
      cpu: "4000m"
      memory: "8Gi"
    requests:
      cpu: "1000m"
      memory: "2Gi"

  nodeSelector:
    node-class: worker
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - podAffinityTerm:
          labelSelector:
            matchLabels:
              component: scheduler
          topologyKey: kubernetes.io/hostname
        weight: 100

# Airflow webserver settings
webserver:

  replicas: 1

  serviceAccount:
    create: false
    name: airflow

  resources:
    limits:
      cpu: "4000m"
      memory: "8Gi"
    requests:
      cpu: "1500m"
      memory: "2Gi"

  webserverConfig: |
    import os
    from airflow.configuration import conf
    from flask_appbuilder.security.manager import AUTH_OAUTH
    from flask import Flask
    from flask_appbuilder import SQLA, AppBuilder
    SQLALCHEMY_DATABASE_URI = conf.get("core", "SQL_ALCHEMY_CONN")
    basedir = os.path.abspath(os.path.dirname(__file__))
    AUTH_USER_REGISTRATION_ROLE = "Viewer"
    AUTH_TYPE = AUTH_OAUTH
    AUTH_ROLES_SYNC_AT_LOGIN = True
    AUTH_USER_REGISTRATION = True
    AZURE_TENANT_ID = os.environ.get("AZURE_TENANT_ID")
    API_BASE = f"https://login.microsoftonline.com/{AZURE_TENANT_ID}/oauth2"
    ACCESS_TOKEN_URL = f"{API_BASE}/token"
    AUTHORIZE_URL = f"{API_BASE}/authorize"
    AUTH_ROLES_MAPPING = {
                              ***********
    }
    OAUTH_PROVIDERS = [
       {
        "name": "azure",
        "icon": "fa-windows",
        "token_key": "access_token",
        "remote_app": {
            "client_id": os.environ.get("AZURE_CLIENT_ID"),
            "client_secret":  os.environ.get("AZURE_CLIENT_SECRET"),
            "api_base_url": API_BASE,
            "client_kwargs": {
                "resource": os.environ.get("AZURE_CLIENT_ID"),
                "scope": "User.read name preferred_username email profile upn https://graph.windows.net/.default openid"
            },
            "request_token_url": None,
            "access_token_url": ACCESS_TOKEN_URL,
            "authorize_url": AUTHORIZE_URL,
        },
    }
    ]
# Airflow Triggerer Config
triggerer:
  enabled: true

  # Create ServiceAccount
  serviceAccount:
    create: false
    name: airflow

workers:
  serviceAccount:
    create: false
    name: airflow

  resources:
    limits:
      cpu: "1000m"
      memory: "2Gi"
    requests:
      cpu: "400m"
      memory: "1000Mi"

# Flower settings
flower:
  enabled: false

# Statsd settings
statsd:
  enabled: false

# Configuration for the redis provisioned by the chart
redis:
  enabled: false

postgresql:
  enabled: false

# Git sync
dags:
  gitSync:
    enabled: true
    repo: git@github.com:***********
    branch: development/airflow
    subPath: "dags"
    sshKeySecret: "airflow-git"

    knownHosts: |
      github.com ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAq2A7hRGmdnm9tUDbO9IDSwBK6TbQa+PXYPCPy6rbTrTtw7PHkccKrpp0yVhp5HdEIcKr6pLlVDBfOLX9QUsyCOV0wzfjIJNlGEYsdlLJizHhbn2mUjvSAHQqZETYP81eFzLQNnPHt4EVVUh7VfDESU84KezmD5QlWpXLmvU31/yMf+Se8xhHTvKSCZIFImWwoG6mbUoWf9nzpIoaSjB+weqqUUmpaaasXVal72J+UX2B+2RPW3RcT0eOzQgqlJL3RKrTJvdsjE3JEAvGq3lGHSZXy28G3skua2SmVi/w4yCE6gbODqnTWlg7+wC604ydGXA8VJiS5ap43JXiUFFAaQ==
    wait: 120
    resources:
      limits:
        cpu: "100m"
        memory: "400Mi"
      requests:
        cpu: "50m"
        memory: "200Mi"

Anything else

No response

Are you willing to submit PR?

Code of Conduct

boring-cyborg[bot] commented 2 years ago

Thanks for opening your first issue here! Be sure to follow the issue template!

potiuk commented 2 years ago

You need to provide logs (and possibly analyse) of the webserver and likely kubernetes from around the time, the webserver is killed. Ideally you should try to analyze it before and see if you can identify the reason yourself. Also you should see how the log files differ from "normal" situation.

There is no way we can act on it without seeing this information. Converting it into discussion until more information is available.