Closed kasteph closed 3 years ago
I had the same issue there were multiple reasons, so thought of sharing
I was able to resolve this for my single-node testing cluster. Checking the kubectl describe
for the postgresql pod, I noticed that it could not bind a persistent volume claim, which had the effect of every pod failing to connect to postgres (as it could not start). I also noticed that other pods failed to bind a persistent volume claim too, namely the redis and worker pods. The solution was to create 3 persistent volumes with sufficient space (10 GB for 2 of them, and 100 GB for the 3rd) and to make sure the storage classes for each was set to ""
(empty string).
@noah-gil Do you have the configuration to get the correct cluster to run Airflow? I'm using the last helm chart of Airflow, and I used this configuration for the cluster... but I get the same error :(
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: airflow
region: us-east-1
version: "1.23"
managedNodeGroups:
- name: workers
instanceType: t3.medium
privateNetworking: true
minSize: 1
maxSize: 3
desiredCapacity: 3
volumeSize: 20
ssh:
allow: true
publicKeyName: airflow-workstation
labels: { role: worker }
tags:
nodegroup-role: worker
iam:
withAddonPolicies:
ebs: true
imageBuilder: true
efs: true
albIngress: true
autoScaler: true
cloudWatch: true
externalDNS: true
I'm getting the same error. Currently debugging but will open a separate issue if I get to a standstill [ sorry to add on to the comments of this closed issue ]
- name: Add apache helm chart
run: |-
helm repo add apache-airflow https://airflow.apache.org
- name: Update helm charts
run: |-
helm repo update
- name: Deploy to latest image cluster
run: |-
helm upgrade --install $NAMESPACE apache-airflow/airflow \
--timeout 3m30s --debug --force --namespace $NAMESPACE --create-namespace \
--set images.airflow.repository="$ARTIFACT_REGISTRY/gp-ops-controller-$BRANCH_NAME/$REPO_NAME/$BRANCH_NAME/$GITHUB_SHA" \
--set images.airflow.tag=latest \
--set images.airflow.pullPolicy=Always \
--set images.airflow.pullSecretName=registry-credentials \
--set executor=CeleryExecutor \
--set pgbouncer.enabled=true \
--set airflowLocalSettings="" \
--set secret_key="$AIRFLOW__WEBSERVER__SECRET_KEY" \
--set logging.remote_logging=true \
--set logging.remote_base_log_folder="gs://ops-controller-$BRANCH_NAME-bucket/$GITHUB_ENV/dags/logs"
I added the update step after installing the apache-airflow helm chart.
My cluster has 3 bound persistent volumes.
I increased the machine type of my node pool to e2-highmem-2 and it contains 9 nodes across 3 zones.
I have 18 vCPUs in my cluster and 144GB of memory.
Not sure how to proceed. Any suggestions welcome, thank you!
I tried deleting/uninstalling the airflow deployment and also wiped clean the airflow repo from local helm. Still facing same issue. Will post my progress in case if i'm able to fix it.
client.go:568: [debug] Add/Modify event for airflow-run-airflow-migrations: MODIFIED
client.go:607: [debug] airflow-run-airflow-migrations: Jobs active: 0, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for airflow-run-airflow-migrations: MODIFIED
upgrade.go:434: [debug] warning: Upgrade "airflow" failed: post-upgrade hooks failed: job failed: BackoffLimitExceeded
Error: UPGRADE FAILED: post-upgrade hooks failed: job failed: BackoffLimitExceeded
helm.go:84: [debug] post-upgrade hooks failed: job failed: BackoffLimitExceeded
UPGRADE FAILED
main.newUpgradeCmd.func2
helm.sh/helm/v3/cmd/helm/upgrade.go:201
github.com/spf13/cobra.(*Command).execute
github.com/spf13/cobra@v1.5.0/command.go:872
github.com/spf13/cobra.(*Command).ExecuteC
github.com/spf13/cobra@v1.5.0/command.go:990
github.com/spf13/cobra.(*Command).Execute
github.com/spf13/cobra@v1.5.0/command.go:918
main.main
helm.sh/helm/v3/cmd/helm/helm.go:83
runtime.main
runtime/proc.go:250
runtime.goexit
runtime/asm_amd64.s:1571
Command: `sudo helm install airflow apache-airflow/airflow --namespace airflow --debug Machine: AWS EC2 ubuntu, running on kind cluster
Helm Version
ubuntu@AMRAPCMU200050L:~/dir$ helm version
version.BuildInfo{Version:"v3.10.1", GitCommit:"9f88ccb6aee40b9a0535fcc7efea6055e1ef72c9", GitTreeState:"clean", GoVersion:"go1.18.7"}
Without any details while the migration job failed, I am afraid commenting on closed issue will not help. You need to see the logs of the job that failed and post it (ideally as a new issue as this might be completelty different issue).
Without any details while the migration job failed, I am afraid commenting on closed issue will not help. You need to see the logs of the job that failed and post it (ideally as a new issue as this might be completelty different issue).
Apologies @potiuk, i'll create a new issue with more detailed instructions to reproduce the issue. Cheers!!
To anyone else who has stumbled upon this thread, for me the issue was Docker Desktop not having enough memory. I increased this to 9gb with a Swap of 2gb and repeated the helm install process and all was fine
My issue was fixed when I cleared more of system space
Without any details while the migration job failed, I am afraid commenting on closed issue will not help. You need to see the logs of the job that failed and post it (ideally as a new issue as this might be completelty different issue).
These are the commands I'm using the check the logs of the failed job:
$ kubectl describe job airflow-run-airflow-migrations
Name: airflow-run-airflow-migrations
Namespace: airflow
Selector: controller-uid=3a6f5bd7-2128-42be-a28d-7a50b215ff3f
Labels: chart=airflow-1.7.0
component=run-airflow-migrations
heritage=Helm
release=airflow
tier=airflow
Annotations: batch.kubernetes.io/job-tracking:
helm.sh/hook: post-install,post-upgrade
helm.sh/hook-delete-policy: before-hook-creation,hook-succeeded
helm.sh/hook-weight: 1
Parallelism: 1
Completions: 1
Completion Mode: NonIndexed
Start Time: Mon, 12 Dec 2022 09:44:39 -0700
Pods Statuses: 1 Active (0 Ready) / 0 Succeeded / 0 Failed
Pod Template:
Labels: component=run-airflow-migrations
controller-uid=3a6f5bd7-2128-42be-a28d-7a50b215ff3f
job-name=airflow-run-airflow-migrations
release=airflow
tier=airflow
Service Account: airflow-migrate-database-job
Containers:
run-airflow-migrations:
Image: apache/airflow:2.4.1
Port: <none>
Host Port: <none>
Args:
bash
-c
exec \
airflow db upgrade
Environment:
PYTHONUNBUFFERED: 1
AIRFLOW__CORE__FERNET_KEY: <set to the key 'fernet-key' in secret 'airflow-fernet-key'> Optional: false
AIRFLOW__CORE__SQL_ALCHEMY_CONN: <set to the key 'connection' in secret 'airflow-airflow-metadata'> Optional: false
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: <set to the key 'connection' in secret 'airflow-airflow-metadata'> Optional: false
AIRFLOW_CONN_AIRFLOW_DB: <set to the key 'connection' in secret 'airflow-airflow-metadata'> Optional: false
AIRFLOW__WEBSERVER__SECRET_KEY: <set to the key 'webserver-secret-key' in secret 'airflow-webserver-secret-key'> Optional: false
AIRFLOW__CELERY__BROKER_URL: <set to the key 'connection' in secret 'airflow-broker-url'> Optional: false
Mounts:
/opt/airflow/airflow.cfg from config (ro,path="airflow.cfg")
Volumes:
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: airflow-airflow-config
Optional: false
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 7m16s job-controller Created pod: airflow-run-airflow-migrations-wtlzd
$ kubectl logs airflow-run-airflow-migrations-wtlzd
Error from server (BadRequest): container "run-airflow-migrations" in pod "airflow-run-airflow-migrations-wtlzd" is waiting to start: trying and failing to pull image
Looks like my issue is similar to the one described above by @Abhinav1598... but unsure which image is the one that is causing the failure.
You myst check your logs on K8S - this is absolutely normal for you as someone who manages k8s installation to fix any problems and be able to diagnose this. You have to learn it I am afraid @beascar. Various tools (kubectl, helm, k9s) are useful for that and your job is basically to master them. You chose k8s as your deployment, so you need to understand how to diagnose various problems there as a consequence.
I cannot solve and diagnose your k8s installation for you, but If you are not familiar with using kubectl (you should eventually), one useful tool to use is helm install --dry-run
- it will show you the resources that Helm chart creates after applying all the templates - just find the right Pod/container and you will find what image it pulls. And you can also check this way what are the resources created by helm. K9s is also useful to look at your k8s installation in "exploratory" way - and it allows to learn how k8s works much faster.
Good luck with the diagnoses.
I have spectated, that in my case postgres pod were in pending state due to claim - hope it helps some of you above
for windows after increasing WSL memory, it worked!! https://learn.microsoft.com/en-us/windows/wsl/wsl-config#configure-global-options-with-wslconfig
Hi
Same issue. Helm output
client.go:339: [debug] jobs.batch "airflow-run-airflow-migrations" not found
client.go:128: [debug] creating 1 resource(s)
client.go:540: [debug] Watching for changes to Job airflow-run-airflow-migrations with timeout of 5m0s
client.go:568: [debug] Add/Modify event for airflow-run-airflow-migrations: ADDED
client.go:607: [debug] airflow-run-airflow-migrations: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for airflow-run-airflow-migrations: MODIFIED
client.go:607: [debug] airflow-run-airflow-migrations: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for airflow-run-airflow-migrations: MODIFIED
client.go:607: [debug] airflow-run-airflow-migrations: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for airflow-run-airflow-migrations: MODIFIED
client.go:607: [debug] airflow-run-airflow-migrations: Jobs active: 0, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for airflow-run-airflow-migrations: MODIFIED
client.go:310: [debug] Starting delete for "airflow-create-user" Job
client.go:339: [debug] jobs.batch "airflow-create-user" not found
client.go:128: [debug] creating 1 resource(s)
client.go:540: [debug] Watching for changes to Job airflow-create-user with timeout of 5m0s
client.go:568: [debug] Add/Modify event for airflow-create-user: ADDED
client.go:607: [debug] airflow-create-user: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for airflow-create-user: MODIFIED
client.go:607: [debug] airflow-create-user: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for airflow-create-user: MODIFIED
client.go:607: [debug] airflow-create-user: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for airflow-create-user: MODIFIED
client.go:607: [debug] airflow-create-user: Jobs active: 0, jobs failed: 0, jobs succeeded: 0
upgrade.go:434: [debug] warning: Upgrade "airflow" failed: post-upgrade hooks failed: timed out waiting for the condition
Error: UPGRADE FAILED: post-upgrade hooks failed: timed out waiting for the condition
helm.go:84: [debug] post-upgrade hooks failed: timed out waiting for the condition
UPGRADE FAILED
main.newUpgradeCmd.func2
helm.sh/helm/v3/cmd/helm/upgrade.go:201
github.com/spf13/cobra.(*Command).execute
github.com/spf13/cobra@v1.5.0/command.go:872
github.com/spf13/cobra.(*Command).ExecuteC
github.com/spf13/cobra@v1.5.0/command.go:990
github.com/spf13/cobra.(*Command).Execute
github.com/spf13/cobra@v1.5.0/command.go:918
main.main
helm.sh/helm/v3/cmd/helm/helm.go:83
runtime.main
runtime/proc.go:250
runtime.goexit
runtime/asm_amd64.s:1594
Migration job log shows all done but never gets "success" state?
Container: run-airflow-migrations
Filter
Disconnected
/home/airflow/.local/lib/python3.10/site-packages/airflow/models/base.py:49 MovedIn20Warning: Deprecated API features detected! These feature(s) are not compatible with SQLAlchemy 2.0. To prevent incompatible upgrades prior to updating applications, ensure requirements files are pinned to "sqlalchemy<2.0". Set environment variable SQLALCHEMY_WARN_20=1 to show all deprecation warnings. Set environment variable SQLALCHEMY_SILENCE_UBER_WARNING=1 to silence this message. (Background on SQLAlchemy 2.0 at: https://sqlalche.me/e/b8d9)
DB: postgresql://postgres:***@airflow-pgbouncer.hector-staging:6543/airflow-metadata?sslmode=disable
Performing upgrade with database postgresql://postgres:***@airflow-pgbouncer.hector-staging:6543/airflow-metadata?sslmode=disable
[2023-09-07T17:26:46.733+0000] {migration.py:205} INFO - Context impl PostgresqlImpl.
[2023-09-07T17:26:46.734+0000] {migration.py:208} INFO - Will assume transactional DDL.
[2023-09-07T17:26:46.751+0000] {db.py:1571} INFO - Creating tables
INFO [alembic.runtime.migration] Context impl PostgresqlImpl.
INFO [alembic.runtime.migration] Will assume transactional DDL.
Upgrades done
$ kubectl --insecure-skip-tls-verify get jobs -n hector-staging airflow-run-airflow-migrations
NAME COMPLETIONS DURATION AGE
airflow-run-airflow-migrations 1/1 14s 74s
$ kubectl --insecure-skip-tls-verify describe jobs -n hector-staging airflow-run-airflow-migrations
Name: airflow-run-airflow-migrations
Namespace: hector-staging
Selector: controller-uid=11553df0-9e05-42ef-ae5a-78146ff935a7
Labels: chart=airflow-1.9.0
component=run-airflow-migrations
heritage=Helm
release=airflow
tier=airflow
Annotations: batch.kubernetes.io/job-tracking:
helm.sh/hook: post-install,post-upgrade
helm.sh/hook-delete-policy: before-hook-creation,hook-succeeded
helm.sh/hook-weight: 1
Parallelism: 1
Completions: 1
Completion Mode: NonIndexed
Start Time: Thu, 07 Sep 2023 13:54:28 -0400
Completed At: Thu, 07 Sep 2023 13:54:42 -0400
Duration: 14s
Pods Statuses: 0 Active (0 Ready) / 1 Succeeded / 0 Failed
Pod Template:
Labels: component=run-airflow-migrations
controller-uid=11553df0-9e05-42ef-ae5a-78146ff935a7
job-name=airflow-run-airflow-migrations
release=airflow
tier=airflow
Service Account: airflow-migrate-database-job
Containers:
run-airflow-migrations:
Image: airflow/master:latest
Port: <none>
Host Port: <none>
Args:
bash
-c
exec \
airflow db upgrade
Environment Variables from:
airflow-auth-provider Secret Optional: false
Environment:
PYTHONUNBUFFERED: 1
AIRFLOW__CORE__FERNET_KEY: <set to the key 'fernet-key' in secret 'airflow-fernet-key'> Optional: false
AIRFLOW__CORE__SQL_ALCHEMY_CONN: <set to the key 'connection' in secret 'airflow-airflow-metadata'> Optional: false
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: <set to the key 'connection' in secret 'airflow-airflow-metadata'> Optional: false
AIRFLOW_CONN_AIRFLOW_DB: <set to the key 'connection' in secret 'airflow-airflow-metadata'> Optional: false
AIRFLOW__WEBSERVER__SECRET_KEY: <set to the key 'webserver-secret-key' in secret 'airflow-webserver-key'> Optional: false
Mounts:
/opt/airflow/airflow.cfg from config (ro,path="airflow.cfg")
/opt/airflow/config/airflow_local_settings.py from config (ro,path="airflow_local_settings.py")
Volumes:
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: airflow-airflow-config
Optional: false
Events: <none>
Increase timeout (look at help of helm), or increase memory (check your resources settings. Or if you use Argo or Similar look at our docs for chart https://airflow.apache.org/docs/helm-chart/stable/index.html#installing-the-chart-with-argo-cd-flux-rancher-or-terraform
Thanks @potiuk
Can you by any chance elaborate on why does one need to
createUserJob:
useHelmHooks: false
applyCustomEnv: false
migrateDatabaseJob:
useHelmHooks: false
applyCustomEnv: false
for Argo, Rancher etc? As in: why without this (or with this? I'm confused now) the migrations will not be run?
I think it's the question to Argo and Rancher.
The current way works with standard Helm - they seem to use the hooks in a non-standard way, but maybe you can help developing better ways. We are open-source projects so we aim to support standards, not commercial solutions that somewhat modified it. But if you use such a solution and want to help with making it better supported - cool.
Some of the initial reasoning was described here https://github.com/apache/airflow/issues/17447 but if someone (you?) find a better way of supporting Argo/Rancher that's cool. We are happy to accept contributions to make it easier/better. I personally don't use Argo, so I am not able to comment more other than - this is the way someone at some point found as working solution. But if someone else finds a better way and can confirm it works (and keeps it working for regular Helm Chart - this is even cooler).
Airflow is created by > 2600 contributors - and often people who miss something or find it confusing, spend time to fix it better and contribute back. So - if you think you can help with analysing and providing a better fix - cool.
Apache Airflow version: 2.0.2
Kubernetes version (if you are using kubernetes) (use
kubectl version
): 1.19Environment:
kind
locallyuname -a
): Darwin MacBook-Pro 19.6.0 Darwin Kernel Version 19.6.0: Mon Apr 12 20:57:45 PDT 2021; root:xnu-6153.141.28.1~1/RELEASE_X86_64 x86_64What happened:
Helm chart does not successfully deploy to a kind cluster despite following the Quick Start. Repeatedly tried multiple times and the flower, postgres, redis and statsd services run fine but it fails at the run-airflow-migrations service with a
CrashLoopBackoff
:What you expected to happen:
Successful Helm deployment.
How to reproduce it:
kind create cluster --image kindest/node:v1.18.15
helm repo add apache-airflow https://airflow.apache.org
kubectl create namespace airflow
helm install airflow apache-airflow/airflow --namespace airflow --debug