feat: add `airflow.kubernetesPodTemplate.lifecycle` value

iSWATxJOKERi commented 2 years ago

What issues does your PR fix? Sometimes it is necessary to perform a certain action/do something when a container has started. Other's have also had this problem: #463

In my team's case for example, after we've read secrets from Vault and mounted them in a directory, let's say, usr/local/share/ca-certificates, we want to add these certificates files to the trust store in etc/ssl/certs/ca-certificates.crt by running update-ca-certificates.

The PR allows you to define the container lifecycle hook in the kubernetesPodTemplate, something that was previously only possible by adding in the whole of the pod template via the airflow.kubernetesPodTemplate.stringOverride variable which can be very difficult to do (didn't successfully do it) because it's hard to tell exactly what needs to go into the full pod template of the stringOverride key in order to replicate what you get when you look at the output of a pod spec as-is with -o yaml.

What does your PR do? Adds the airflow.kubernetesPodTemplate.lifecycle value to set container lifecycle hooks on the KubernetesExecutor pod_template. Other's have also had this problem

Checklist

For all Pull Requests

[x] Commits are signed off
[x] Commits have semantic messages
[x] Documentation updated
[x] Passes ct linting

For releasing ONLY

[ ] Chart.yaml version bumped
[ ] CHANGELOG.md updated

thesuperzapper commented 1 year ago

@iSWATxJOKERi Thanks for the PR, however, I am still not sold that there is a legitimate use case that needs us to expose the lifecycle configs to the user. For example, in your ca-certificates use case, a much safer (and already available option) is to use extraInitContainers, and possibly also shareProcessNamespace.

Furthermore, using lifecycle hooks might result in unexpected behavior, specifically:

preStop commands are not guaranteed by Kubernetes to be run before the containers ENTRYPOINT, so there is a possible race condition
the hooks can actually be delivered MORE than once!

While I can sort-of imagine (very niche) use cases for postStart hooks, if we choose to expose their functionality to the user, we must put strong warnings about possible unexpected behavior, and recommend using init-containers in any use case that needs something to run BEFORE the pod starts.

If we were to implement a feature like this, here are my initial thoughts on your PR

You have only added this ability to the airflow.kubernetesPodTemplate pods, we should add values for all the deployments/pods and probably also have a "global" one under an airflow.xxxx value
Technically the correct link for the "spec" of Lifecycle v1 core is https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.20/#lifecycle-v1-core, so that's a better link for the docstring.

I also want to note that we currently use the preStop hook in the workers.celery.gracefullTermination feature for worker pods, so allowing users to set their own lifecycle would not be possible alongside this feature. Given this, I think it's probably best to not expose the full lifecycle spec, but rather create more focused xxxx.postStartScript and xxxx.preStopScript values (that take a string), so that we can do whatever we want in the chart and then run the users extra script (if they have provided one).

We can create these values for all of our deployments, but also make one called airflow.postStartScript and airflow.preStopScript that will set the script on all our airflow deployments/pods. (NOTE: I am not sure if airflow.preStopScript should override workers.preStopScript, or be run in addition to it)

For situations when we have our own commands that we want to run (like in the workers.celery.gracefullTermination feature), we would run whatever our chart-provided command was, and then run the user's one AFTER.

An example of what a workers.preStopScript value might look like:

workers:
   preStopScript: |
     #!/usr/bin/env bash
     echo "Hello world!"

iSWATxJOKERi commented 1 year ago

Hi @thesuperzapper , thanks for the quick response and thorough feedback! Learning alot from the things you mentioned that I didn't even think about!

For example, in your ca-certificates use case, a much safer (and already available option) is to use extraInitContainers

Should've been more detailed in the reason for the MR, but yeah we tried using an initContainer to run the update-ca-certificates command after the vault-agent-init container has mounted the certs into the shared volume mount at /usr/local/share/ca-certificates but that didn't work and we even tried running it inside of a shared volume mount itself thinking that'd also work, but it just runs it in the init container's filesystem not the base containers one. or the other containers that share the mount.

While I can sort-of imagine (very niche) use cases for postStart hooks, if we choose to expose their functionality to the user, we must put strong warnings about possible unexpected behavior

Good point, I agree

I also want to note that we currently use the preStop hook in the workers.celery.gracefullTermination feature for worker pods, so allowing users to set their own lifecycle would not be possible alongside this feature.

Another good point, didn't know lifecycle was used somewhere in the chart...So yeah I think like you said, the two additional scripts would be the best at solving this problem without having to worry about race conditions and gotchas and the sort.

we would run whatever our chart-provided command was, and then run the user's one AFTER.

Do you have an idea where we would add the airflow.postStartScript and airflow.preStopScript and the values for the other deployments/pods in the templates? and how I can help in pushing this along. I'm willing to help push this through! Thanks again for helping us!

thesuperzapper commented 1 year ago

@iSWATxJOKERi I think the values would be like this:

airflow:
  # these are like the other "global" configs like `airflow.extraVolumeMounts` 
  # the question is should the more specific values like `scheduler.postStartScript` override OR run in addition to
  postStartScript: ""
  preStopScript: ""

  kubernetesPodTemplate:
      postStartScript: ""
      preStopScript: ""

  dbMigrations:
      postStartScript: ""
      preStopScript: ""

  sync:
      postStartScript: ""
      preStopScript: ""

scheduler:
  postStartScript: ""
  preStopScript: ""

web:
  postStartScript: ""
  preStopScript: ""

workers:
  postStartScript: ""
  preStopScript: ""

triggerer:
  postStartScript: ""
  preStopScript: ""

flower:
  postStartScript: ""
  preStopScript: ""

We should probably mount these scripts as files from a secret (using projected volume), this could get messy if not done with some kind of template (like how we do pod tolerations), as there are so many different possible scripts.

If you need more guidance on making that template, I can point you in the right direction.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had activity in 60 days. It will be closed in 7 days if no further activity occurs.

Thank you for your contributions.

Issues never become stale if any of the following is true:

they are added to a Project
they are added to a Milestone
they have the lifecycle/frozen label

airflow-helm / charts