airflow-helm / charts

The User-Community Airflow Helm Chart is the standard way to deploy Apache Airflow on Kubernetes with Helm. Originally created in 2017, it has since helped thousands of companies create production-ready deployments of Airflow on Kubernetes.
https://github.com/airflow-helm/charts/tree/main/charts/airflow
Apache License 2.0
631 stars 473 forks source link

CrashLoopBackOff for most pods when enabling logs persistence with EFS #659

Closed dylac closed 1 year ago

dylac commented 1 year ago

Checks

Chart Version

8.6.1

Kubernetes Version

Client Version: version.Info{Major:"1", Minor:"22+", GitVersion:"v1.22.6-eks-7d68063", GitCommit:"f24e667e49fb137336f7b064dba897beed639bad", GitTreeState:"clean", BuildDate:"2022-02-23T19:32:14Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22+", GitVersion:"v1.22.13-eks-15b7512", GitCommit:"94138dfbea757d7aaf3b205419578ef186dd5efb", GitTreeState:"clean", BuildDate:"2022-08-31T19:15:48Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

Helm Version

version.BuildInfo{Version:"v3.8.2", GitCommit:"6e3701edea09e5d55a8ca2aae03a68917630e91b", GitTreeState:"clean", GoVersion:"go1.17.5"}

Description

I am trying to configure an EFS share as a destination for persistent logging. When I do the helm upgrade --install, the PVC shows as "Bound" to the correct PV and storageClass from the airflow pods' perspective, and I've tested that I can mount and write to it with extraVolumes using a sample app from the AWS documentation, so I'm fairly sure it's not an AWS/EKS/non-helm issue at this point. The relevant part of my values.yaml overrides are in the appropriate section below.

I've tried many different combinations of the relevant options, for example:

None of the above has helped. All containers except redis and pgbouncer (I'm using an external db with RDS) fail almost immediately with Back-off restarting failed container and CrashLoopBackOff. This means there isn't much in the way of logs. Notably though, the efs-csi-controller containers don't show any attempt by the pods to mount the EFS share (they do when I used the AWS' docs sample app).

If I just set persistence: false everything works fine; logs are written to the path. Just not to EFS.

Feeling a bit stuck. Any advice is appreciated!

Relevant Logs

➜  kubernetes-yaml git:(master) ✗ kubectl describe pod airflow-web-58557b7954-lj22q
Name:         airflow-web-58557b7954-lj22q
Namespace:    default
Priority:     0
Node:         ip-172-22-5-115.ec2.internal/172.22.5.115
Start Time:   Tue, 18 Oct 2022 14:45:46 -0400
Labels:       app=airflow
              component=web
              pod-template-hash=58557b7954
              release=airflow
Annotations:  checksum/config-webserver-config: a49aae09f4883ea186d758be5b13b196644b4c885635dbf1f210cb183a3011e0
              checksum/secret-config-envs: aa5eedf54d7e780db73e882991eefcf1d4812f419d418ed6d616bfc29e490205
              checksum/secret-local-settings: e3b0c44298fc1c149afbf4c8996fb92427ae41e46419934ca495991b7852b855
              cluster-autoscaler.kubernetes.io/safe-to-evict: true
              kubernetes.io/psp: eks.privileged
Status:       Pending
IP:           172.25.203.69
IPs:
  IP:           172.25.203.69
Controlled By:  ReplicaSet/airflow-web-58557b7954
Init Containers:
  dags-git-clone:
    Container ID:   docker://b7757e23e4fd512324e60851e57d036bac27571dfde1fd18fd034c7289df0635
    Image:          k8s.gcr.io/git-sync/git-sync:v3.5.0
    Image ID:       docker-pullable://k8s.gcr.io/git-sync/git-sync@sha256:d16f5b2bca94cdbb4e40b256bfe639450a6f0577dbd8b3fcaf126a2261822fcd
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 18 Oct 2022 14:45:57 -0400
      Finished:     Tue, 18 Oct 2022 14:45:58 -0400
    Ready:          True
    Restart Count:  0
    Environment Variables from:
      airflow-config-envs  Secret  Optional: false
    Environment:
      GIT_SYNC_ONE_TIME:           true
      GIT_SYNC_ROOT:               /dags
      GIT_SYNC_DEST:               repo
      GIT_SYNC_REPO:               https://github.research.chop.edu/analytics/data-pipeline-dags.git
      GIT_SYNC_BRANCH:             uat
      GIT_SYNC_REV:                HEAD
      GIT_SYNC_DEPTH:              1
      GIT_SYNC_WAIT:               60
      GIT_SYNC_TIMEOUT:            120
      GIT_SYNC_ADD_USER:           true
      GIT_SYNC_MAX_SYNC_FAILURES:  3
      GIT_KNOWN_HOSTS:             false
      REDIS_PASSWORD:              <set to the key 'redis-password' in secret 'airflow-redis'>  Optional: false
      CONNECTION_CHECK_MAX_COUNT:  0
    Mounts:
      /dags from dags-data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hk9dc (ro)
  check-db:
    Container ID:  docker://ff95a6de4d2ce0b3f1589779fcd282ed7a911ad029597762e4298e163c84eb7c
    Image:         123412341234.dkr.ecr.us-east-1.amazonaws.com/airflow-image:latest
    Image ID:      docker-pullable://123412341234.dkr.ecr.us-east-1.amazonaws.com/airflow-image@sha256:35caacc30d78285aea466ca2ad3f54119bbdf41efee15163e0c853b44718f17fb
    Port:          <none>
    Host Port:     <none>
    Command:
      /usr/bin/dumb-init
      --
      /entrypoint
    Args:
      bash
      -c
      exec timeout 60s airflow db check
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 18 Oct 2022 15:02:15 -0400
      Finished:     Tue, 18 Oct 2022 15:02:18 -0400
    Ready:          False
    Restart Count:  8
    Environment Variables from:
      airflow-config-envs  Secret  Optional: false
    Environment:
      REDIS_PASSWORD:              <set to the key 'redis-password' in secret 'airflow-redis'>  Optional: false
      CONNECTION_CHECK_MAX_COUNT:  0
    Mounts:
      /airflow-logs from logs-data (rw,path="airflow-logs")
      /opt/airflow/dags from dags-data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hk9dc (ro)
  wait-for-db-migrations:
    Container ID:
    Image:         123412341234.dkr.ecr.us-east-1.amazonaws.com/airflow-image:latest
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      /usr/bin/dumb-init
      --
      /entrypoint
    Args:
      bash
      -c
      exec airflow db check-migrations -t 60
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment Variables from:
      airflow-config-envs  Secret  Optional: false
    Environment:
      REDIS_PASSWORD:              <set to the key 'redis-password' in secret 'airflow-redis'>  Optional: false
      CONNECTION_CHECK_MAX_COUNT:  0
    Mounts:
      /airflow-logs from logs-data (rw,path="airflow-logs")
      /opt/airflow/dags from dags-data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hk9dc (ro)
Containers:
  airflow-web:
    Container ID:
    Image:         123412341234.dkr.ecr.us-east-1.amazonaws.com/airflow-image:latest
    Image ID:
    Port:          8080/TCP
    Host Port:     0/TCP
    Command:
      /usr/bin/dumb-init
      --
      /entrypoint
    Args:
      bash
      -c
      exec airflow webserver
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Liveness:       http-get http://:web/health delay=10s timeout=30s period=10s #success=1 #failure=6
    Readiness:      http-get http://:web/health delay=10s timeout=30s period=10s #success=1 #failure=6
    Environment Variables from:
      airflow-config-envs  Secret  Optional: false
    Environment:
      REDIS_PASSWORD:              <set to the key 'redis-password' in secret 'airflow-redis'>  Optional: false
      CONNECTION_CHECK_MAX_COUNT:  0
    Mounts:
      /airflow-logs from logs-data (rw,path="airflow-logs")
      /opt/airflow/dags from dags-data (rw)
      /opt/airflow/webserver_config.py from webserver-config (ro,path="webserver_config.py")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hk9dc (ro)
  dags-git-sync:
    Container ID:
    Image:          k8s.gcr.io/git-sync/git-sync:v3.5.0
    Image ID:
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment Variables from:
      airflow-config-envs  Secret  Optional: false
    Environment:
      GIT_SYNC_ROOT:               /dags
      GIT_SYNC_DEST:               repo
      GIT_SYNC_REPO:               https://github.research.chop.edu/analytics/data-pipeline-dags.git
      GIT_SYNC_BRANCH:             uat
      GIT_SYNC_REV:                HEAD
      GIT_SYNC_DEPTH:              1
      GIT_SYNC_WAIT:               60
      GIT_SYNC_TIMEOUT:            120
      GIT_SYNC_ADD_USER:           true
      GIT_SYNC_MAX_SYNC_FAILURES:  3
      GIT_KNOWN_HOSTS:             false
      REDIS_PASSWORD:              <set to the key 'redis-password' in secret 'airflow-redis'>  Optional: false
      CONNECTION_CHECK_MAX_COUNT:  0
    Mounts:
      /dags from dags-data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hk9dc (ro)
Conditions:
  Type              Status
  Initialized       False
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  dags-data:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  logs-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  efs-claim
    ReadOnly:   false
  webserver-config:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  airflow-webserver-config
    Optional:    false
  kube-api-access-hk9dc:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  19m                   default-scheduler  Successfully assigned default/airflow-web-58557b7954-lj22q to ip-172-22-5-115.ec2.internal
  Normal   Pulled     19m                   kubelet            Container image "k8s.gcr.io/git-sync/git-sync:v3.5.0" already present on machine
  Normal   Created    19m                   kubelet            Created container dags-git-clone
  Normal   Started    19m                   kubelet            Started container dags-git-clone
  Normal   Pulled     19m                   kubelet            Successfully pulled image "123412341234.dkr.ecr.us-east-1.amazonaws.com/airflow-image:latest" in 182.361684ms
  Normal   Pulled     19m                   kubelet            Successfully pulled image "123412341234.dkr.ecr.us-east-1.amazonaws.com/airflow-image:latest" in 127.129902ms
  Normal   Pulled     19m                   kubelet            Successfully pulled image "123412341234.dkr.ecr.us-east-1.amazonaws.com/airflow-image:latest" in 131.668638ms
  Normal   Pulling    18m (x4 over 19m)     kubelet            Pulling image "123412341234.dkr.ecr.us-east-1.amazonaws.com/airflow-image:latest"
  Normal   Created    18m (x4 over 19m)     kubelet            Created container check-db
  Normal   Started    18m (x4 over 19m)     kubelet            Started container check-db
  Normal   Pulled     18m                   kubelet            Successfully pulled image "123412341234.dkr.ecr.us-east-1.amazonaws.com/airflow-image:latest" in 128.687795ms
  Warning  BackOff    4m38s (x67 over 19m)  kubelet            Back-off restarting failed container

➜  ~ kubectl logs deployment/efs-csi-controller -n kube-system -c csi-provisioner
Found 2 pods, using pod/efs-csi-controller-5dbff995c9-rdxpr
W1018 11:59:08.701265       1 feature_gate.go:235] Setting GA feature gate Topology=true. It will be removed in a future release.
I1018 11:59:08.701362       1 feature_gate.go:243] feature gates: &{map[Topology:true]}
I1018 11:59:08.701402       1 csi-provisioner.go:132] Version: v2.1.1-0-g353098c90
I1018 11:59:08.701426       1 csi-provisioner.go:155] Building kube configs for running in cluster...
I1018 11:59:08.737436       1 connection.go:153] Connecting to unix:///var/lib/csi/sockets/pluginproxy/csi.sock
I1018 11:59:08.741102       1 common.go:111] Probing CSI driver for readiness
I1018 11:59:08.742267       1 csi-provisioner.go:202] Detected CSI driver efs.csi.aws.com
I1018 11:59:08.745937       1 csi-provisioner.go:244] CSI driver does not support PUBLISH_UNPUBLISH_VOLUME, not watching VolumeAttachments
I1018 11:59:08.748944       1 controller.go:756] Using saving PVs to API server in background
I1018 11:59:08.756100       1 leaderelection.go:243] attempting to acquire leader lease kube-system/efs-csi-aws-com...

^ nothing here

Custom Helm Values


airflow:
  extraVolumeMounts:
    - name: airflow-efs
      mountPath: /airflow-logs

  extraVolumes:
    - name: airflow-efs
      persistentVolumeClaim:
        claimName: efs-claim

logs:
  path: /airflow-logs

  persistence:
    enabled: true

    existingClaim: "efs-claim"
    subPath: "airflow-logs"
    storageClass: "efs-sc"
    accessMode: ReadWriteMany
    size: 5Gi
thesuperzapper commented 1 year ago

@dylac it's going to be difficult to debug this without some logs, did you know that you can get the logs of failed containers using --previous flag of kubectl get logs Pod/xxxxxxx --previous?

Alternatively, I always recommend the k9s CLI for managing Kubernetes, which exposes pod logs with the simple press of the L key, and the previous logs with the press of the P key.

Can you try and get the logs of the crashing pod for me?

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had activity in 60 days. It will be closed in 7 days if no further activity occurs.

Thank you for your contributions.


Issues never become stale if any of the following is true:

  1. they are added to a Project
  2. they are added to a Milestone
  3. they have the lifecycle/frozen label