airflow-helm / charts

The User-Community Airflow Helm Chart is the standard way to deploy Apache Airflow on Kubernetes with Helm. Originally created in 2017, it has since helped thousands of companies create production-ready deployments of Airflow on Kubernetes.
https://github.com/airflow-helm/charts/tree/main/charts/airflow
Apache License 2.0
630 stars 474 forks source link

fix: liveness probes airflow 2.6.0 support #738

Closed davidspek closed 1 year ago

davidspek commented 1 year ago

What issues does your PR fix?

What does your PR do?

This PR updates the liveness probes for the scheduler and triggerer so they are compatible with Airflow 2.6.0. Not sure how best to deal with backwards compatibility though so this PR still requires some work. I assume there are some ways backwards compatibility with airflow versions that already exist, so if those can be recommended I can update this PR.

Checklist

For all Pull Requests

For releasing ONLY

wikitops commented 1 year ago

Is there a plan to merge this pull request?

thesuperzapper commented 1 year ago

@DavidSpek @wikitops I agree we need to get this fixed, but I am not sure explicitly checking the version of airflow in the code is the best way.

Perhaps a simple approach might be just to wrap try/except around the import, and possibly exit with NO error if we can't import either one (so if it changes again in the future, the probes don't completely break the user's deployment).

davidspek commented 1 year ago

Is there a reason why the /health probing endpoint isn't being used? It would also be possible to inspect the JSON that is returned from that endpoint for more advanced probing. I agree that the implementation is a bit clunky and I don't like checking the version in the code, I just needed a quick fix initially.

thesuperzapper commented 1 year ago

@DavidSpek regarding the /health probe, what could we additionally do with it?

davidspek commented 1 year ago

@thesuperzapper The health endpoint seems to be doing most of the same things as the current probe, but exposes it as an http endpoint on the component. https://github.com/apache/airflow/blob/86d62d3c0398d0d842c9368c5832de92bc6ec4c5/airflow/utils/scheduler_health.py#L31 is an example for this, and it's the code I used to base this PR around. I don't think that endpoint covers the full capabilities of the current probe though, like task creation. I'm also not sure when the endpoint was introduced so there might be an issue with backwards compatibility to certain Airflow versions of the chart were to start using it. However, since using the endpoint would solve forwards compatibility it could be worth considering.

Some more information about the probes can also be found at https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/check-health.html.

You're most familiar with the codebase so I'll go off your judgement what would be the best route forward and can look at implementing that.

chattarajoy commented 1 year ago

I am using the same /health endpoint within my org, thought of implementing the same for upstream https://github.com/airflow-helm/charts/pull/742

PTAL

potiuk commented 1 year ago

This is a wrong fix. Look at the airlfow -community managed chart. We've changed long time ago the liveness probes to depart from Hard-coded ways of checking it and runnning airflow jobs check command. You should do the same here.

The details (including the data models and code used to run the health checl) should not be present in the health check. They can and will change as new versions of airflow will be released and changing it in this way makes completely no sense, because it will likely break for 2.7 and beyond.

See this for an example how it should be done: https://github.com/apache/airflow/blob/main/chart/templates/_helpers.yaml#L649

thesuperzapper commented 1 year ago

@potiuk @chattarajoy I am aware of the airflow jobs check feature (and the /health endpoint), but there are two problems that prevent the user-community chart from adopting it exclusively for our health checks:

  1. We aim to support all versions of airflow, and airflow jobs check was only added recently. I note that you also have a similar version-based if/else condition in the upstream chart.
  2. We have an optional feature called "task creation check", which is more advanced than checking the airflow-provided SchedulerJob, and checks for actual tasks being scheduled (which is needed for some older airflow versions which are subject to deadlocks in their scheduler, preventing SchedulerJob from reflecting the true state of the scheduler).

Given this, I think my initial patch to get airflow 2.6.0 working properly will be what I suggested in https://github.com/airflow-helm/charts/pull/738#issuecomment-1542848354, that is, a try/except block to detect which of SchedulerJob or SchedulerJobRunner is available.

To be clear, I don't see it as a problem if the chart needs a patch release for new versions of airflow (but I would obviously appreciate airflow not breaking/renaming APIs if not truly needed). Also, we have a support matrix for airflow version <-> chart version which helps users know when its safe to upgrade to new airflow versions.

thesuperzapper commented 1 year ago

Closing in preference of https://github.com/airflow-helm/charts/pull/743

potiuk commented 1 year ago

but I would obviously appreciate airflow not breaking/renaming APIs if not truly needed

Airflow has (as of 2.6.0) very detailed list of what is considered a public interface and what is not: https://airflow.apache.org/docs/apache-airflow/stable/public-airflow-interface.html

And it has this very clear explanation of what is not a public API. You are now explicitly using somethig (Database Structure) that Airflow specifically explained as something that is not a public API and explained that it will be changing in the future:

What is not part of the Public Interface of Apache Airflow?

Everything not mentioned in this document should be considered as non-Public Interface.

Sometimes in other applications those components could be relied on to keep backwards compatibility, but in Airflow they are not parts of the Public Interface and might change any time:

Database structure is considered to be an internal implementation detail and you should not assume the structure is going to be maintained in a backwards-compatible way.

Web UI is continuously evolving and there are no backwards compatibility guarantees on HTML elements.

Python classes except those explicitly mentioned in this document, are considered an internal implementation detail and you should not assume they will be maintained in a backwards-compatible way.

thesuperzapper commented 1 year ago

@potiuk my main request is that changes to long-standing DB structures are treated as "soft breaking", so only happen in major or minor releases (e.g. 2.6.X -> 2.7.0), because I am not the only one using them.

For example, many users have plugins (like custom UI's and monitoring systems) that make various queries to the database.

I don't think it's a huge inconvenience for the Airflow project to limit DB structure changes to minor versions, and it also limits the blast radius when these changes cause issues in airflow itself (like https://github.com/apache/airflow/pull/30302), as people are more careful about upgrading to major/minor releases.

potiuk commented 1 year ago

I don't think it's a huge inconvenience for the Airflow project to limit DB structure changes to minor versions, and it also limits the blast radius when these changes cause issues in airflow itself (like https://github.com/apache/airflow/pull/30302), as people are more careful about upgrading to major/minor releases.

It is huge inconvenience and we all (in the community, by voting) agreed that DB structure is an implementation detail and we communicate it clearly now. Don't realy, nor use it. There is a comprehensive API that can be used instead (and it was introduced almost 3 years ago precisely for the purpose that you can use it). This was actually the purpose of the API introduced then and it was clearly stated that this is the "maintained" and stable way of accessing Airflow from outside.

There is absolutely no difference in which version - minor or patchlevel - it is going to change. We had cases where we had to change structure in patchlevel versions already (and we will likely have to do it again).

Simply Don't rely on it being unchanged. You've been warned. It's very clearly stated now, while before it has been implicit.

And it's not discussion with me to have. It should be a discussion on a devlist if you want to change the current approach.

potiuk commented 1 year ago

@potiuk my main request is that changes to long-standing DB structures are treated as "soft breaking", so only happen in major or minor releases (e.g. 2.6.X -> 2.7.0), because I am not the only one using them.

And in case you have not noticed: Yes. We know. We want everyone using them to stop doing so.