apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.15k stars 14.04k forks source link

Quickstart Helm Chart fails post-install #16176

Closed kasteph closed 3 years ago

kasteph commented 3 years ago

Apache Airflow version: 2.0.2

Kubernetes version (if you are using kubernetes) (use kubectl version): 1.19

Environment:

What happened:

Helm chart does not successfully deploy to a kind cluster despite following the Quick Start. Repeatedly tried multiple times and the flower, postgres, redis and statsd services run fine but it fails at the run-airflow-migrations service with a CrashLoopBackoff:

  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  5m19s                  default-scheduler  Successfully assigned airflow/airflow-run-airflow-migrations-c9pph to kind-control-plane
  Normal   Pulled     2m43s (x5 over 5m17s)  kubelet            Container image "apache/airflow:2.0.2" already present on machine
  Normal   Created    2m43s (x5 over 5m17s)  kubelet            Created container run-airflow-migrations
  Normal   Started    2m43s (x5 over 5m17s)  kubelet            Started container run-airflow-migrations
  Warning  BackOff    9s (x18 over 4m25s)    kubelet            Back-off restarting failed container

What you expected to happen:

Successful Helm deployment.

How to reproduce it:

  1. Created a kind cluster: kind create cluster --image kindest/node:v1.18.15
  2. Added Helm chart repo: helm repo add apache-airflow https://airflow.apache.org
  3. Created kube namespace: kubectl create namespace airflow
  4. Installed chart: helm install airflow apache-airflow/airflow --namespace airflow --debug
install.go:173: [debug] Original chart version: ""
install.go:190: [debug] CHART PATH: /Users/stephaniesamson/Library/Caches/helm/repository/airflow-1.0.0.tgz

client.go:282: [debug] Starting delete for "airflow-broker-url" Secret
client.go:122: [debug] creating 1 resource(s)
client.go:282: [debug] Starting delete for "airflow-fernet-key" Secret
client.go:122: [debug] creating 1 resource(s)
client.go:282: [debug] Starting delete for "airflow-redis-password" Secret
client.go:122: [debug] creating 1 resource(s)
client.go:122: [debug] creating 30 resource(s)
client.go:282: [debug] Starting delete for "airflow-run-airflow-migrations" Job
client.go:122: [debug] creating 1 resource(s)
client.go:491: [debug] Watching for changes to Job airflow-run-airflow-migrations with timeout of 5m0s
client.go:519: [debug] Add/Modify event for airflow-run-airflow-migrations: ADDED
client.go:558: [debug] airflow-run-airflow-migrations: Jobs active: 0, jobs failed: 0, jobs succeeded: 0
client.go:519: [debug] Add/Modify event for airflow-run-airflow-migrations: MODIFIED
client.go:558: [debug] airflow-run-airflow-migrations: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
Error: failed post-install: timed out waiting for the condition
helm.go:81: [debug] failed post-install: timed out waiting for the condition
ahoodasf commented 1 year ago

I had the same issue there were multiple reasons, so thought of sharing

  1. Initially I did not have enough nodes available in my k8s cluster so the airflow-run-airflow-migrations job pod was not getting scheduled
  2. After increasing the number of nodes, I still had the same issue that was because I was using t3 micro (free tier) instances which do not support some kind of networking
noah-gil commented 1 year ago

I was able to resolve this for my single-node testing cluster. Checking the kubectl describe for the postgresql pod, I noticed that it could not bind a persistent volume claim, which had the effect of every pod failing to connect to postgres (as it could not start). I also noticed that other pods failed to bind a persistent volume claim too, namely the redis and worker pods. The solution was to create 3 persistent volumes with sufficient space (10 GB for 2 of them, and 100 GB for the 3rd) and to make sure the storage classes for each was set to "" (empty string).

KishinNext commented 1 year ago

@noah-gil Do you have the configuration to get the correct cluster to run Airflow? I'm using the last helm chart of Airflow, and I used this configuration for the cluster... but I get the same error :(

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: airflow
  region: us-east-1
  version: "1.23"

managedNodeGroups:
  - name: workers
    instanceType: t3.medium
    privateNetworking: true
    minSize: 1
    maxSize: 3
    desiredCapacity: 3
    volumeSize: 20
    ssh:
      allow: true
      publicKeyName: airflow-workstation
    labels: { role: worker }
    tags:
      nodegroup-role: worker
    iam:
      withAddonPolicies:
        ebs: true
        imageBuilder: true
        efs: true
        albIngress: true
        autoScaler: true
        cloudWatch: true
        externalDNS: true
MarianneRay commented 1 year ago

I'm getting the same error. Currently debugging but will open a separate issue if I get to a standstill [ sorry to add on to the comments of this closed issue ]

  - name: Add apache helm chart
    run: |-
      helm repo add apache-airflow https://airflow.apache.org

  - name: Update helm charts
    run: |-
      helm repo update

  - name: Deploy to latest image cluster
    run: |-
      helm upgrade --install $NAMESPACE apache-airflow/airflow \
        --timeout 3m30s  --debug --force --namespace $NAMESPACE --create-namespace \
        --set images.airflow.repository="$ARTIFACT_REGISTRY/gp-ops-controller-$BRANCH_NAME/$REPO_NAME/$BRANCH_NAME/$GITHUB_SHA" \
        --set images.airflow.tag=latest \
        --set images.airflow.pullPolicy=Always \
        --set images.airflow.pullSecretName=registry-credentials \
        --set executor=CeleryExecutor \
        --set pgbouncer.enabled=true \
        --set airflowLocalSettings="" \
        --set secret_key="$AIRFLOW__WEBSERVER__SECRET_KEY" \
        --set logging.remote_logging=true \
        --set logging.remote_base_log_folder="gs://ops-controller-$BRANCH_NAME-bucket/$GITHUB_ENV/dags/logs"

I added the update step after installing the apache-airflow helm chart. My cluster has 3 bound persistent volumes. I increased the machine type of my node pool to e2-highmem-2 and it contains 9 nodes across 3 zones.
I have 18 vCPUs in my cluster and 144GB of memory.

Not sure how to proceed. Any suggestions welcome, thank you!

agarwalYashBCG commented 1 year ago

I tried deleting/uninstalling the airflow deployment and also wiped clean the airflow repo from local helm. Still facing same issue. Will post my progress in case if i'm able to fix it.


client.go:568: [debug] Add/Modify event for airflow-run-airflow-migrations: MODIFIED
client.go:607: [debug] airflow-run-airflow-migrations: Jobs active: 0, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for airflow-run-airflow-migrations: MODIFIED
upgrade.go:434: [debug] warning: Upgrade "airflow" failed: post-upgrade hooks failed: job failed: BackoffLimitExceeded
Error: UPGRADE FAILED: post-upgrade hooks failed: job failed: BackoffLimitExceeded
helm.go:84: [debug] post-upgrade hooks failed: job failed: BackoffLimitExceeded
UPGRADE FAILED
main.newUpgradeCmd.func2
    helm.sh/helm/v3/cmd/helm/upgrade.go:201
github.com/spf13/cobra.(*Command).execute
    github.com/spf13/cobra@v1.5.0/command.go:872
github.com/spf13/cobra.(*Command).ExecuteC
    github.com/spf13/cobra@v1.5.0/command.go:990
github.com/spf13/cobra.(*Command).Execute
    github.com/spf13/cobra@v1.5.0/command.go:918
main.main
    helm.sh/helm/v3/cmd/helm/helm.go:83
runtime.main
    runtime/proc.go:250
runtime.goexit
    runtime/asm_amd64.s:1571

Command: `sudo helm install airflow apache-airflow/airflow --namespace airflow --debug Machine: AWS EC2 ubuntu, running on kind cluster

Helm Version

ubuntu@AMRAPCMU200050L:~/dir$ helm version
version.BuildInfo{Version:"v3.10.1", GitCommit:"9f88ccb6aee40b9a0535fcc7efea6055e1ef72c9", GitTreeState:"clean", GoVersion:"go1.18.7"}
potiuk commented 1 year ago

Without any details while the migration job failed, I am afraid commenting on closed issue will not help. You need to see the logs of the job that failed and post it (ideally as a new issue as this might be completelty different issue).

agarwalYashBCG commented 1 year ago

Without any details while the migration job failed, I am afraid commenting on closed issue will not help. You need to see the logs of the job that failed and post it (ideally as a new issue as this might be completelty different issue).

Apologies @potiuk, i'll create a new issue with more detailed instructions to reproduce the issue. Cheers!!

alexlightbody commented 1 year ago

To anyone else who has stumbled upon this thread, for me the issue was Docker Desktop not having enough memory. I increased this to 9gb with a Swap of 2gb and repeated the helm install process and all was fine

lordvcs commented 1 year ago

My issue was fixed when I cleared more of system space

beascar commented 1 year ago

Without any details while the migration job failed, I am afraid commenting on closed issue will not help. You need to see the logs of the job that failed and post it (ideally as a new issue as this might be completelty different issue).

These are the commands I'm using the check the logs of the failed job:

$ kubectl describe job airflow-run-airflow-migrations
Name:             airflow-run-airflow-migrations
Namespace:        airflow
Selector:         controller-uid=3a6f5bd7-2128-42be-a28d-7a50b215ff3f
Labels:           chart=airflow-1.7.0
                  component=run-airflow-migrations
                  heritage=Helm
                  release=airflow
                  tier=airflow
Annotations:      batch.kubernetes.io/job-tracking: 
                  helm.sh/hook: post-install,post-upgrade
                  helm.sh/hook-delete-policy: before-hook-creation,hook-succeeded
                  helm.sh/hook-weight: 1
Parallelism:      1
Completions:      1
Completion Mode:  NonIndexed
Start Time:       Mon, 12 Dec 2022 09:44:39 -0700
Pods Statuses:    1 Active (0 Ready) / 0 Succeeded / 0 Failed
Pod Template:
  Labels:           component=run-airflow-migrations
                    controller-uid=3a6f5bd7-2128-42be-a28d-7a50b215ff3f
                    job-name=airflow-run-airflow-migrations
                    release=airflow
                    tier=airflow
  Service Account:  airflow-migrate-database-job
  Containers:
   run-airflow-migrations:
    Image:      apache/airflow:2.4.1
    Port:       <none>
    Host Port:  <none>
    Args:
      bash
      -c
      exec \
      airflow db upgrade
    Environment:
      PYTHONUNBUFFERED:                     1
      AIRFLOW__CORE__FERNET_KEY:            <set to the key 'fernet-key' in secret 'airflow-fernet-key'>                      Optional: false
      AIRFLOW__CORE__SQL_ALCHEMY_CONN:      <set to the key 'connection' in secret 'airflow-airflow-metadata'>                Optional: false
      AIRFLOW__DATABASE__SQL_ALCHEMY_CONN:  <set to the key 'connection' in secret 'airflow-airflow-metadata'>                Optional: false
      AIRFLOW_CONN_AIRFLOW_DB:              <set to the key 'connection' in secret 'airflow-airflow-metadata'>                Optional: false
      AIRFLOW__WEBSERVER__SECRET_KEY:       <set to the key 'webserver-secret-key' in secret 'airflow-webserver-secret-key'>  Optional: false
      AIRFLOW__CELERY__BROKER_URL:          <set to the key 'connection' in secret 'airflow-broker-url'>                      Optional: false
    Mounts:
      /opt/airflow/airflow.cfg from config (ro,path="airflow.cfg")
  Volumes:
   config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      airflow-airflow-config
    Optional:  false
Events:
  Type    Reason            Age    From            Message
  ----    ------            ----   ----            -------
  Normal  SuccessfulCreate  7m16s  job-controller  Created pod: airflow-run-airflow-migrations-wtlzd

$ kubectl logs airflow-run-airflow-migrations-wtlzd
Error from server (BadRequest): container "run-airflow-migrations" in pod "airflow-run-airflow-migrations-wtlzd" is waiting to start: trying and failing to pull image

Looks like my issue is similar to the one described above by @Abhinav1598... but unsure which image is the one that is causing the failure.

potiuk commented 1 year ago

You myst check your logs on K8S - this is absolutely normal for you as someone who manages k8s installation to fix any problems and be able to diagnose this. You have to learn it I am afraid @beascar. Various tools (kubectl, helm, k9s) are useful for that and your job is basically to master them. You chose k8s as your deployment, so you need to understand how to diagnose various problems there as a consequence.

I cannot solve and diagnose your k8s installation for you, but If you are not familiar with using kubectl (you should eventually), one useful tool to use is helm install --dry-run - it will show you the resources that Helm chart creates after applying all the templates - just find the right Pod/container and you will find what image it pulls. And you can also check this way what are the resources created by helm. K9s is also useful to look at your k8s installation in "exploratory" way - and it allows to learn how k8s works much faster.

Good luck with the diagnoses.

amorskoy commented 1 year ago

I have spectated, that in my case postgres pod were in pending state due to claim - hope it helps some of you above

image

Amin-Siddique commented 1 year ago

for windows after increasing WSL memory, it worked!! https://learn.microsoft.com/en-us/windows/wsl/wsl-config#configure-global-options-with-wslconfig

curlup commented 1 year ago

Hi

Same issue. Helm output

client.go:339: [debug] jobs.batch "airflow-run-airflow-migrations" not found
client.go:128: [debug] creating 1 resource(s)
client.go:540: [debug] Watching for changes to Job airflow-run-airflow-migrations with timeout of 5m0s
client.go:568: [debug] Add/Modify event for airflow-run-airflow-migrations: ADDED
client.go:607: [debug] airflow-run-airflow-migrations: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for airflow-run-airflow-migrations: MODIFIED
client.go:607: [debug] airflow-run-airflow-migrations: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for airflow-run-airflow-migrations: MODIFIED
client.go:607: [debug] airflow-run-airflow-migrations: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for airflow-run-airflow-migrations: MODIFIED
client.go:607: [debug] airflow-run-airflow-migrations: Jobs active: 0, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for airflow-run-airflow-migrations: MODIFIED
client.go:310: [debug] Starting delete for "airflow-create-user" Job
client.go:339: [debug] jobs.batch "airflow-create-user" not found
client.go:128: [debug] creating 1 resource(s)
client.go:540: [debug] Watching for changes to Job airflow-create-user with timeout of 5m0s
client.go:568: [debug] Add/Modify event for airflow-create-user: ADDED
client.go:607: [debug] airflow-create-user: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for airflow-create-user: MODIFIED
client.go:607: [debug] airflow-create-user: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for airflow-create-user: MODIFIED
client.go:607: [debug] airflow-create-user: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for airflow-create-user: MODIFIED
client.go:607: [debug] airflow-create-user: Jobs active: 0, jobs failed: 0, jobs succeeded: 0
upgrade.go:434: [debug] warning: Upgrade "airflow" failed: post-upgrade hooks failed: timed out waiting for the condition
Error: UPGRADE FAILED: post-upgrade hooks failed: timed out waiting for the condition
helm.go:84: [debug] post-upgrade hooks failed: timed out waiting for the condition
UPGRADE FAILED
main.newUpgradeCmd.func2
    helm.sh/helm/v3/cmd/helm/upgrade.go:201
github.com/spf13/cobra.(*Command).execute
    github.com/spf13/cobra@v1.5.0/command.go:872
github.com/spf13/cobra.(*Command).ExecuteC
    github.com/spf13/cobra@v1.5.0/command.go:990
github.com/spf13/cobra.(*Command).Execute
    github.com/spf13/cobra@v1.5.0/command.go:918
main.main
    helm.sh/helm/v3/cmd/helm/helm.go:83
runtime.main
    runtime/proc.go:250
runtime.goexit
    runtime/asm_amd64.s:1594

Migration job log shows all done but never gets "success" state?


Container: run-airflow-migrations
Filter
Disconnected

/home/airflow/.local/lib/python3.10/site-packages/airflow/models/base.py:49 MovedIn20Warning: Deprecated API features detected! These feature(s) are not compatible with SQLAlchemy 2.0. To prevent incompatible upgrades prior to updating applications, ensure requirements files are pinned to "sqlalchemy<2.0". Set environment variable SQLALCHEMY_WARN_20=1 to show all deprecation warnings.  Set environment variable SQLALCHEMY_SILENCE_UBER_WARNING=1 to silence this message. (Background on SQLAlchemy 2.0 at: https://sqlalche.me/e/b8d9)
DB: postgresql://postgres:***@airflow-pgbouncer.hector-staging:6543/airflow-metadata?sslmode=disable
Performing upgrade with database postgresql://postgres:***@airflow-pgbouncer.hector-staging:6543/airflow-metadata?sslmode=disable
[2023-09-07T17:26:46.733+0000] {migration.py:205} INFO - Context impl PostgresqlImpl.
[2023-09-07T17:26:46.734+0000] {migration.py:208} INFO - Will assume transactional DDL.
[2023-09-07T17:26:46.751+0000] {db.py:1571} INFO - Creating tables
INFO  [alembic.runtime.migration] Context impl PostgresqlImpl.
INFO  [alembic.runtime.migration] Will assume transactional DDL.
Upgrades done
$ kubectl --insecure-skip-tls-verify  get jobs -n hector-staging airflow-run-airflow-migrations
NAME                             COMPLETIONS   DURATION   AGE
airflow-run-airflow-migrations   1/1           14s        74s
$ kubectl --insecure-skip-tls-verify  describe jobs -n hector-staging airflow-run-airflow-migrations
Name:             airflow-run-airflow-migrations
Namespace:        hector-staging
Selector:         controller-uid=11553df0-9e05-42ef-ae5a-78146ff935a7
Labels:           chart=airflow-1.9.0
                  component=run-airflow-migrations
                  heritage=Helm
                  release=airflow
                  tier=airflow
Annotations:      batch.kubernetes.io/job-tracking:
                  helm.sh/hook: post-install,post-upgrade
                  helm.sh/hook-delete-policy: before-hook-creation,hook-succeeded
                  helm.sh/hook-weight: 1
Parallelism:      1
Completions:      1
Completion Mode:  NonIndexed
Start Time:       Thu, 07 Sep 2023 13:54:28 -0400
Completed At:     Thu, 07 Sep 2023 13:54:42 -0400
Duration:         14s
Pods Statuses:    0 Active (0 Ready) / 1 Succeeded / 0 Failed
Pod Template:
  Labels:           component=run-airflow-migrations
                    controller-uid=11553df0-9e05-42ef-ae5a-78146ff935a7
                    job-name=airflow-run-airflow-migrations
                    release=airflow
                    tier=airflow
  Service Account:  airflow-migrate-database-job
  Containers:
   run-airflow-migrations:
    Image:     airflow/master:latest
    Port:       <none>
    Host Port:  <none>
    Args:
      bash
      -c
      exec \
      airflow db upgrade
    Environment Variables from:
      airflow-auth-provider  Secret  Optional: false
    Environment:
      PYTHONUNBUFFERED:                     1
      AIRFLOW__CORE__FERNET_KEY:            <set to the key 'fernet-key' in secret 'airflow-fernet-key'>               Optional: false
      AIRFLOW__CORE__SQL_ALCHEMY_CONN:      <set to the key 'connection' in secret 'airflow-airflow-metadata'>         Optional: false
      AIRFLOW__DATABASE__SQL_ALCHEMY_CONN:  <set to the key 'connection' in secret 'airflow-airflow-metadata'>         Optional: false
      AIRFLOW_CONN_AIRFLOW_DB:              <set to the key 'connection' in secret 'airflow-airflow-metadata'>         Optional: false
      AIRFLOW__WEBSERVER__SECRET_KEY:       <set to the key 'webserver-secret-key' in secret 'airflow-webserver-key'>  Optional: false
    Mounts:
      /opt/airflow/airflow.cfg from config (ro,path="airflow.cfg")
      /opt/airflow/config/airflow_local_settings.py from config (ro,path="airflow_local_settings.py")
  Volumes:
   config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      airflow-airflow-config
    Optional:  false
Events:        <none>
potiuk commented 1 year ago

Increase timeout (look at help of helm), or increase memory (check your resources settings. Or if you use Argo or Similar look at our docs for chart https://airflow.apache.org/docs/helm-chart/stable/index.html#installing-the-chart-with-argo-cd-flux-rancher-or-terraform

curlup commented 12 months ago

Thanks @potiuk

Can you by any chance elaborate on why does one need to

createUserJob:
  useHelmHooks: false
  applyCustomEnv: false
migrateDatabaseJob:
  useHelmHooks: false
  applyCustomEnv: false

for Argo, Rancher etc? As in: why without this (or with this? I'm confused now) the migrations will not be run?

potiuk commented 12 months ago

I think it's the question to Argo and Rancher.

The current way works with standard Helm - they seem to use the hooks in a non-standard way, but maybe you can help developing better ways. We are open-source projects so we aim to support standards, not commercial solutions that somewhat modified it. But if you use such a solution and want to help with making it better supported - cool.

Some of the initial reasoning was described here https://github.com/apache/airflow/issues/17447 but if someone (you?) find a better way of supporting Argo/Rancher that's cool. We are happy to accept contributions to make it easier/better. I personally don't use Argo, so I am not able to comment more other than - this is the way someone at some point found as working solution. But if someone else finds a better way and can confirm it works (and keeps it working for regular Helm Chart - this is even cooler).

Airflow is created by > 2600 contributors - and often people who miss something or find it confusing, spend time to fix it better and contribute back. So - if you think you can help with analysing and providing a better fix - cool.