Can't change Helm configuration for db-migrations job

aodj commented 1 year ago

Checks

[X] I have checked for existing issues.
[X] This report is about the User-Community Airflow Helm Chart.

Chart Version

8.6.1

Kubernetes Version

Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.16", GitCommit:"b28e1f370a4a4c428ddbeababcaf0198f048fcac", GitTreeState:"clean", BuildDate:"2022-11-09T13:30:52Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.10", GitCommit:"5c1d2d4295f9b4eb12bfbf6429fdf989f2ca8a02", GitTreeState:"clean", BuildDate:"2023-01-27T22:54:20Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}

Helm Version

version.BuildInfo{Version:"v3.11.2", GitCommit:"912ebc1cd10d38d340f048efaf0abda047c3468e", GitTreeState:"clean", GoVersion:"go1.20.2"}

Description

It's not possible to change the helm.sh annotations on for the db-migrations job as they are hardcoded in the template:

  annotations:
    helm.sh/hook: post-install,post-upgrade
    helm.sh/hook-weight: "-100"
    helm.sh/hook-delete-policy: before-hook-creation

In the environment I'm deploying Airflow to, deployments are handled by Argo which waits for the webserver and scheduler to become "healthy" when deploying, before continuing with the db-migrations job, as defined by the helm.sh/hook value.

As the job is set to run post-install, Argo doesn't run it until the other resources are ready (as explained in the Helm documentation under Hooks and the Release Lifecycle). This means in a new install, the check-for-db-migrations initContainer will never complete, because the db-migrations job can't run, as the webserver and scheduler never become ready.

I thought it would be a simple case of setting the annotations in the values.yaml but I see now that the chart has them hard coded, and any attempt to set these helm.sh annotations results in invalid yaml:

  annotations:
    helm.sh/hook: post-install,post-upgrade
    helm.sh/hook-weight: "-100"
    helm.sh/hook-delete-policy: before-hook-creation
    helm.sh/hook: pre-install,pre-upgrade
    helm.sh/hook-delete-policy: hook-succeeded,before-hook-creation

This also applies to the other jobs, but they aren't critical for starting the deployments, so will eventually resolve themselves.

Relevant Logs

No response

Custom Helm Values

No response

aodj commented 1 year ago

I should point out that the only way to resolve this is to change the db-migration to not be a job, let it run the migrations, then you can revert it to being a job. This will work until you need to upgrade your Airflow version and it includes new db migrations, and then you need to do the whole thing again.

aodj commented 1 year ago

I did more searching and found #266 and #317 and it looks like there might be something I can set to fix this but I'm not sure I understand what needs to change. Can you advise @thesuperzapper ?

thesuperzapper commented 1 year ago

@aodj the default behavior of the chart is to use a Deployment to handle db-migrations because of this exact issue with ArgoCD (and similar systems).

Are you sure you are not setting the airflow.dbMigrations.runAsJob to true:

https://github.com/airflow-helm/charts/blob/f5c2f0cd617982c642789cf66fdef48453f84e67/charts/airflow/values.yaml#L386-L391

aodj commented 1 year ago

My apologies for not making it clearer, but yes I am running it as a job; I do this because I don't want a long lived Python process consuming 5~600 MB (I also run the other sync steps as Kubernetes jobs for this same reason).

This came up in another thread I wrote last year about the possibility of running these jobs as CronJobs rather than Jobs (#540) because they consume upwards of 2 GB per installation (of which I have several). The Airflow version and the variables/connections/etc get updated rarely so I saw very little benefit to having them constantly up and syncing the same data/db migrations.

As for the problem at hand, the annotations are only set on the Jobs (not the Deployments) so I'm basically looking for a way to customise the ordering of the Helm hooks because otherwise it's either 1) run the db migrations as a deployment forever, or 2) run it as a Job but end up in a chicken-and-egg loop where you can't instantiate the webserver/scheduler until the db migrations are run, which only runs after said webserver/scheduler are up and running.

aodj commented 1 year ago

I've got some code running locally that will merge user provided annotations with the default ones set by the chart, if you're interested in a PR:

{{- define "airflow.db_migrations.annotations" -}}
{{- $defaultAnnotations := (dict "helm.sh/hook" "post-install,post-upgrade" "helm.sh/hook-weight" "-100" "helm.sh/hook-delete-policy" "before-hook-creation") -}}
{{- $annotations := mergeOverwrite $defaultAnnotations .Values.airflow.dbMigrations.annotations -}}
{{- toYaml $annotations -}}
{{- end -}}

If it's something that might be useful I could also generalise it for use across all the different jobs.

thesuperzapper commented 1 year ago

@aodj The issue remains that post-install helm hooks won't run until ALL the main Deployments (like the airflow Scheduler) are Ready when using the --wait flag, which will never happen.

So, as far as I know, there is currently no solution that works in all cases except doing the db-migrations with a Deployment. I have proposed a possible solution in upstream helm, see https://github.com/helm/helm/issues/10555#issuecomment-1515156101.

Interestingly, I think that ArgoCD might already support something similar to my proposed behavior with its own argocd.argoproj.io/hook: Snyc hook. But there is currently no way to invoke via helm-hooks, so would require the inclusion of ArgoCD-specific annotation in the chart.

Also, if you are concerned about the resource usage of the Sync Deployments, I think there is a case to be made for consolidating them all into a single Deployment.

aodj commented 1 year ago

@aodj The issue remains that post-install helm hooks won't run until ALL the main Deployments (like the airflow Scheduler) are Ready when using the --wait flag, which will never happen.

In my installations I've found that the db-migration is the only one that needs to be run to allow the webserver and scheduler to be considered healthy; Airflow will start up without connections, pools, users, and so on apparently without problem.

So, as far as I know, there is currently no solution that works in all cases except doing the db-migrations with a Deployment. I have proposed a possible solution in upstream helm, see helm/helm#10555 (comment).

Interestingly, I think that ArgoCD might already support something similar to my proposed behavior with its own argocd.argoproj.io/hook: Snyc hook. But there is currently no way to invoke via helm-hooks, so would require the inclusion of ArgoCD-specific annotation in the chart.

Also, if you are concerned about the resource usage of the Sync Deployments, I think there is a case to be made for consolidating them all into a single Deployment.

Would you be interested in my suggestion above to allow users of this chart control over the helm hooks? I don't mind the default settings but I would like the ability to override them if I so choose.

stale[bot] commented 11 months ago

This issue has been automatically marked as stale because it has not had activity in 60 days. It will be closed in 7 days if no further activity occurs.

Thank you for your contributions.

Issues never become stale if any of the following is true:

they are added to a Project
they are added to a Milestone
they have the lifecycle/frozen label

airflow-helm / charts