airflow-helm / charts

The User-Community Airflow Helm Chart is the standard way to deploy Apache Airflow on Kubernetes with Helm. Originally created in 2017, it has since helped thousands of companies create production-ready deployments of Airflow on Kubernetes.
https://github.com/airflow-helm/charts/tree/main/charts/airflow
Apache License 2.0
631 stars 473 forks source link

OSError: [Errno 19] No such device: '/opt/airflow/dags’ and '/opt/airflow/logs’ #697

Closed stanvv closed 1 year ago

stanvv commented 1 year ago

Checks

Chart Version

8.6.1

Kubernetes Version

Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.0", GitCommit:"b46a3f887ca979b1a5d14fd39cb1af43e7e5d12d", GitTreeState:"clean", BuildDate:"2022-12-08T19:58:30Z", GoVersion:"go1.19.4", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.6", GitCommit:"c86d003ea699ec4bcffee10ad563a26b63561c0e", GitTreeState:"clean", BuildDate:"2022-12-17T10:31:53Z", GoVersion:"go1.18.6", Compiler:"gc", Platform:"linux/amd64"}

Helm Version

version.BuildInfo{Version:"v3.11.0", GitCommit:"472c5736ab01133de504a826bd9ee12cbe4e7904", GitTreeState:"clean", GoVersion:"go1.18.10"}

Description

Recently several Airflow pods crashed due to the error OSError: [Errno 19] No such device: '/opt/airflow/dags’ or '/opt/airflow/logs’ For our acceptance environment the worker pods could not find the logs directory and hence were unable to schedule any job. A full redeployment fixed issue for a couple of days, but then it occurred again. For production, it was the scheduler and triggerer pods that could not find the dags directory. Restarting those pods fixed the issue.

Time frames of all 3 occurrences were different. Setup/configuration of ACC and PRD is similar

Relevant Logs

Defaulted container "airflow-scheduler" out of: airflow-scheduler, check-db (init), wait-for-db-migrations (init)
Hello from custom entrypoint (baked in docker)
Traceback (most recent call last):
  File "/home/airflow/.local/bin/airflow", line 5, in <module>
    from airflow.__main__ import main
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/__init__.py", line 46, in <module>
    settings.initialize()
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/settings.py", line 568, in initialize
    import_local_settings()
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/settings.py", line 525, in import_local_settings
    import airflow_local_settings
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 963, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 906, in _find_spec
  File "<frozen importlib._bootstrap_external>", line 1280, in find_spec
  File "<frozen importlib._bootstrap_external>", line 1252, in _get_spec
  File "<frozen importlib._bootstrap_external>", line 1368, in find_spec
  File "<frozen importlib._bootstrap_external>", line 1408, in _fill_cache
OSError: [Errno 19] No such device: '/opt/airflow/dags'

Custom Helm Values

airflow:
  image:
    repository: our.url/airflow
    tag: "xxxxxx"
    pullPolicy: IfNotPresent
    uid: 50000
    gid: 50000

  executor: KubernetesExecutor
  defaultTolerations:
    - key: "node"
      operator: "Equal"
      value: "compute"
      effect: "NoSchedule"
  config:
    # Security
    AIRFLOW__CORE__SECURE_MODE: "False"
    AIRFLOW__API__AUTH_BACKENDS: "airflow.api.auth.backend.session"
    AIRFLOW__WEBSERVER__EXPOSE_CONFIG: "False"
    AIRFLOW__WEBSERVER__RBAC: "True"

    # DAGS
    AIRFLOW__CORE__LOAD_EXAMPLES: "False"
    AIRFLOW__CORE__PLUGINS_FOLDER: /opt/airflow/dags/plugins
    AIRFLOW__CORE__PARALLELISM: 64

    # K8s
    AIRFLOW__KUBERNETES__DAGS_IN_IMAGE: "False"
    AIRFLOW__KUBERNETES__DELETE_WORKER_PODS: "True"
    AIRFLOW__KUBERNETES__DELETE_WORKER_PODS_ON_FAILURE: "False"
    AIRFLOW__KUBERNETES__NAMESPACE: "airflow"

    AIRFLOW__LOGGING__REMOTE_LOGGING: "True"
    AIRFLOW__LOGGING__REMOTE_LOG_CONN_ID: "our_connection_id"
    AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER: "wasb-airflow-logs"
    AIRFLOW__LOGGING__BASE_LOG_FOLDER: "/opt/airflow/logs/dags"
    # Write logging locally to enable persistency
    AIRFLOW__WEBSERVER__ACCESS_LOGFILE: "/opt/airflow/logs/webserver/access.log"
    AIRFLOW__WEBSERVER__ERROR_LOGFILE: "/opt/airflow/logs/webserver/error.log"
    AIRFLOW__LOGGING__DAG_PROCESSOR_MANAGER_LOG_LOCATION: "/opt/airflow/logs/dags/dag_processor_manager/dag_processor_manager.log"

    AIRFLOW__WEBSERVER__ENABLE_PROXY_FIX: "True"
    AIRFLOW__WEBSERVER__PROXY_FIX_X_PROTO: "1"
    AIRFLOW__WEBSERVER__PROXY_FIX_X_FOR: "1"
    AIRFLOW__WEBSERVER__PROXY_FIX_X_HOST: "1"
    AIRFLOW__WEBSERVER__PROXY_FIX_X_POR: "1"

    # Scheduler
    AIRFLOW__SCHEDULER__MAX_DAGRUNS_PER_LOOP_TO_SCHEDULE: "45"
    AIRFLOW__SCHEDULER__MAX_DAGRUNS_TO_CREATE_PER_LOOP: "15"
    AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION: "False"

    ## Disable noisy "Handling signal: ttou" Gunicorn log messages
    GUNICORN_CMD_ARGS: "--log-level WARNING"
    ENVIRONMENT: "dev"

  extraEnv:
    - name: AIRFLOW__CORE__FERNET_KEY
      valueFrom:
        secretKeyRef:
          name: secret
          key: secret
    - name: FLASK_SECRET_KEY
      valueFrom:
        secretKeyRef:
          name: secret
          key: secret    
    - name: AZ_OAUTH_CLIENT_ID
      valueFrom:
        secretKeyRef:
          name: secret
          key: secret
    - name: AZ_OAUTH_CLIENT_SECRET
      valueFrom:
        secretKeyRef:
          name: secret
          key: secret

scheduler:
  livenessProbe:
    enabled: true
    initialDelaySeconds: 300
    periodSeconds: 30
    failureThreshold: 5
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: agentpool
            operator: In
            values:
            - servingpool
  # disbale log cleanup as per logs.persistence.enabled is true
  logCleanup: False
  enabled: False

triggerer:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
            - key: agentpool
              operator: In
              values:
                - servingpool

dags:
  persistence:
    enabled: True
    storageClass: azurefile
    accessMode: ReadOnlyMany
    size: 1Gi
logs: 
  persistence:
    enabled: True
    storageClass: azurefile

postgresql:
  enabled: False

redis:   
  enabled: False

flower:
  enabled: False

workers:
  enabled: False
    logCleanup:
    enabled: False

serviceMonitor:
  enabled: true  
  selector:
    prometheus: kube-prometheus
  path: /admin/metrics
  interval: "60s"

web:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
            - key: agentpool
              operator: In
              values:
                - servingpool
  webserverConfig:
    stringOverride: |-
      ...
      Config to setup AAD configuration
      ...

pgbouncer:
  enabled: true
  authType: scram-sha-256
  serverSSL:
    mode: verify-ca
  logDisconnections: 1
  logConnections: 1
  verbose: 1
  max_client_conn: 1000
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
            - key: agentpool
              operator: In
              values:
                - servingpool
stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had activity in 60 days. It will be closed in 7 days if no further activity occurs.

Thank you for your contributions.


Issues never become stale if any of the following is true:

  1. they are added to a Project
  2. they are added to a Milestone
  3. they have the lifecycle/frozen label