Official helm chart released by the Airlfow Community

airflow-helm / charts

The User-Community Airflow Helm Chart is the standard way to deploy Apache Airflow on Kubernetes with Helm. Originally created in 2017, it has since helped thousands of companies create production-ready deployments of Airflow on Kubernetes.

https://github.com/airflow-helm/charts/tree/main/charts/airflow

Apache License 2.0

646 stars 477 forks source link

Official helm chart released by the Airlfow Community #211

Closed potiuk closed 3 years ago

potiuk commented 3 years ago

Hello @thesuperzapper

The Airlfow community have finally released the "official" helm chart. Documentation here: http://airflow.apache.org/docs/helm-chart/stable/index.html

Release message/thread here: https://lists.apache.org/thread.html/r131d839158b8a7a92a7813183cae30d248be9e330ea2faaf9e654970%40%3Cdev.airflow.apache.org%3E

We've been discussing before about community providing support for the helm chart, and the community decided to continue with the helm chart donated by Astronomer. This is finally now fully available, released and will be maintained and supported by the community.

I wonder if you would like to continue having the separate chart and maintain it, or maybe you choose the path of deprecating it and helping the users to transition to the the "community managed" official chart. Having two charts in the

I think we are open to help with the transition if you choose that route and possibly even help with developing a transition documentation, how to switch etc. if you choose that route.

Let us know what you think.

andormarkus commented 3 years ago

Hi @potiuk

For me "official" helm chart lacks several features which makes transition not feasible for me. I really like this chart because it offers turn key solution does not require addition configuration steps like the official one.

Currently I miss the following features from the official chart:

subPath is not supported on dag / log persistence
extraPipPackages
post-install/post-update helm hook jobs:
- create_connections.py
- create_pools.py
- create_users.py
- create_variables.py

Adding subPath to the official chart will not be an issue. However, adding the post-install/post-update helm hook jobs / extraPipPackages to the official chart I don't know will airflow community approve it.

extraPipPackages is important for us because we are using official airflow docker image. If we want to add extra pip package then we need to run custom docker image, which we want to avoid.

kaxil commented 3 years ago

Thanks, @andormarkus for your comment, we definitely won't support extraPipPackages or the jobs that create Connections, pool, users or variables.

I understand why you need it but this is not the right approach. Changing resources like that is not correct in my opinion.

For installing the pip package -- you should do it when building the docker file itself -- not on the fly when installing using chart. Once built, the Docker Image should be immutable. This is very important for Production systems when running critical workloads. Connections and Variables should either be handled via Environment Variables or other Secrets Backend --- again for the Production system you do not want to expose secrets in plain text in your values.yaml or the yaml file that you use for overriding values.

For Pools (and even Connections & Vars): You should use the API: https://airflow.apache.org/docs/apache-airflow/stable/stable-rest-api-ref.html#operation/post_pool

And lastly, for Users Management, I would generally recommend using LDAP or OAUTH or other auth-backends.

potiuk commented 3 years ago

Yeah. My opinion is very close to @kaxilss - the "extraPipPackages" and post install hooks are not really following the way how productionized installation should be done. In my opinion (and that's an opinion only) they introduce more problems than they solve, because you add unnecessary variability and delays where it is easily done by just building a custom image, which is both - easy, straightforward and recommended. The official image by Airflow is really a "reference image" and there are several ways you can take to have your own custom image - either by customizing the build process, or extending the image the usual way all images are extended (with FROM: command).

As @kaxil mentioned - for Airflow 2.0+ we have now APIs and even official python client that can be used more easily (and if there are any more features needed, we are happy to discuss about adding them).

Again - this is my opinion, not a source of truth. I think it's up to @thesuperzapper and other maintainers to decide what to do - if they want to continue maintaining the helm chart and support all the potential issues, or incompatibilities coming from future releases of Airflow - this is fine. Apache 2.0 license perfectly allows that, as long as it clear it is not a community-managed project (in the past people were confused, whether the chart is a community managed project or not). And if not, then we could help with providing transition paths. Happy to do so.

Just as a comment - one thing that we might have to take a look at is use of the Airflow Logo and Apache name in the project though (this is a PMC responsibility to take care about it) so that it is not easily confused with Apache-managed project (https://www.apache.org/foundation/marks/#guidelines). I have no idea how whether it is a problem for the ASF to have similarly named charts with Airflow logo and Apache name in both in the way they can be easily confused (I made the mistake myself actually).

I think we might ask for a legal advice on ASF on that whether this is any problem for the foundation at all (I believe ASF is very cautious about ASF brand and it's important that the brand is used properly) - but before we go there and ask, I'd love to hear what the long term planning is for the helm chart from the maintainers - so that we know whether there is anything to ask for.

andormarkus commented 3 years ago

Hi @kaxil @potiuk

We are deploying everything with Terraform (AWS, Kubernetes, Helm) every helm chart is deployed with helm_release.

API: There is no Airflow Terraform provider.... Using official python client / bash script would require us to use null_resource in Terraform which we want to avoid as much as possible. I don't think the community ever build provider for Airflow.

Pools: I only interested to resize the default_pool to 10.000. I couldn't find environment variable for it.
Do you see on option to expose size ofdefault_pool as environment variable?

Connections: We are using SystemsManagerParameterStoreBackend. create_connections.py not relevant for us

Users: We are using OAUTH. create_users.py not relevant for us.

Variables: We are using environment variables create_variables.py not relevant for us.

kaxil commented 3 years ago

@andormarkus Aah .. yes you can override Default Pool slots via Environment Variable, documentation is missing on that page -- I will add that right now.

Use AIRFLOW__CORE__NON_POOLED_TASK_COUNT=10000 -- this should work

Or update configs via [core] non_pooled_task_slot_count = 10000

PR: https://github.com/apache/airflow/pull/15997

Good Point regarding "Rest API and Terraform Resources" -- From a quick search looks like Rest API Terraform provider from MasterCard is relatively popular. Example Usage: https://github.com/Mastercard/terraform-provider-restapi/blob/master/examples/dummy_users_with_fakeserver.tf

Or this one: https://registry.terraform.io/providers/fmontezuma/restapi/latest/docs

Will that work for your use case?

potiuk commented 3 years ago

Hi @kaxil @potiuk

We are deploying everything with Terraform (AWS, Kubernetes, Helm) every helm chart is deployed with helm_release.

API: There is no Airflow Terraform provider.... Using official python client / bash script would require us to use null_resource in Terraform which we want to avoid as much as possible. I don't think the community ever build provider for Airflow.

Actually there is a terraform provider that one of the PMC members of Airlfow wrote, using the new REST API: https://github.com/houqp/terraform-provider-airflow

We might consider making it also part of "community managed" resources. Maybe you can check it out?

potiuk commented 3 years ago

Hey @thesuperzapper - WDYT ? Do you plan to encourage some kind of transitio (and can we help) or do you want to support your helm chart in the future? We are preparing a talk for Airflow summit about the "state of Airflow with K8S" so I think it would be great to know what we can tell the Airflow users.

For me both scenarios are OK - having community-managed helm chart and this one, as well as helping people with transition to the community one, but I think it would be great to agree on the direction.

thesuperzapper commented 3 years ago

@potiuk (also cc @gsemet)

I am happy to maintain this chart for as long as it is still used widely, which will probably be for a while yet as:

many companies are using the chart (or forks) in longstanding production deployments
there are some features which are not present in the "official" chart yet:
1. a highly-structured values.yaml file
2. a (somewhat) comprehensive set of docs
3. lots of values validation (to detect common configuration issue we have seen over the years)
4. support for most past versions of airflow (including 1.10.X)
5. post-install jobs, for things like connections, variables, pools, users, etc
6. a feature like extraPipPackages (which is great for testing, and honestly used by many in production)
7. support for git-sync with HTTP auth (I may be wrong on this one)
I expect the "official" chart will go through a few significant changes over the next few releases to make it more clean and easy to use, for example:
1. re-structuring its values.yaml (and setting up a process to define which values should go where)
2. improved docs
3. culling out some of the more environment-specific features (possibly things like Elasticsearch integration)

Personally, I would love there to only be 1 chart in the future (if this possible given the different opinions), but we are probably a wee-way from that yet (see above comments). However, once the charts become equal (in features, and usability), I will happily start pushing users across, and I expect that would be a natural process anyway.

Note, I'm open to contributing to the "official" chart to help move this process along, I'm just not sure how best to start.

potiuk commented 3 years ago

Note, I'm open to contributing to the "official" chart to help move this process along, I'm just not sure how best to start.

Perfect. That's cool. I think we can close the ticket then and move to the 'collaboration' mode. I think we are more than happy to welcome the PRs to the chart - as long as those are small fixes - it's just a PR to open, if this is something substantial - just write a message to devlist, or open a discussion (if this is something that might require brainstorming) or create an issue in the Airflow Repository. This is the best way to start.

From the quick look at the differences you pointed out:

restructuring of 'values.yaml' probably requires some discussion - whether to do it similarly or differently, what direction to take, how to make backwards compatibility work etc.
documentation updates are usually just PRs (over last few days more and more PRs for helm-chart documentation have been merged and multiple people are adding more and more comprehensive docs so I guess this one will be improved rather quickly). Any help there is more than welcome.
values validation - I think is super-helpful and direct PRs there are more than welcome.
support for past versions of Airflow is not something we want. With Airflow 1.10 End-of-life in June, we encourage everyone to move to 2+. Keeping your chart up and running for a while for people who will decide to stick with it for a while longer might be the reason why the helm chart of yours can stay for a while.
post-install-jobs are something up for discussion, I believe this one might be done in various ways so discussion for that - either in GitHub Actions or in devlist are probably the best way to start
extraPipPackages is something controversial. My goal is to make it super easy to make your own images build/extended from the official airflow image and possibly even semi-automate it in the near future. But I m definitely going to start a discussion about that soon, so happy if you can join it.
git-sync is being improved a lot in some of the recent PRs. I think it's going to be super-comprehensive very soon.

All of this has to go through the usual way we do it in Airflow community - propose, discuss, get to consensus, if consensus is not reachable and change is significant - voting. The absolute normal process for an Apache project. And it would be great if you actively participate in it - knowing that we - as community - might reach different conclusions and directions after deliberate discussion.

I will also reach out to ASF trade to check if there are any issues with naming of the charts - I do not think so, but I want to double-check for sure if there are any guidances. I will post a JIRA ticket about it here when I open it.

gsemet commented 3 years ago

Thanks for keeping me in the loop. I pretty much agree with everything that has been said. (don't know if I have still a word to say, as I do not maintain the project anymore factually).

The end goal is to transition everybody to the official chart, once it has all the features required. Sure it is not optimal to maintain two charts, ideally, the official chart would get all the missing features from this one and the choice will be natural for everybody.

The differences between the two of them might be quite tricky to document, maybe we can start a new markdown document (MIGRATION.md?) in here that will evolve over time with the status of this chart and the status of the official chart. It would do:

compare the features of the two charts (in a table)
and maybe how to transition from airflow-helm/charts to helm-chart/stable for existing installations

I also think we can put a little emphasis in our README to document a bit more clearly the position of this project:

this is a legacy chart with some history and has been used in production by many people
if we can show some statistic of usage it would help newcomers understand why we cannot drop the support as of now
volunteers are still maintaining it (thanks a lot @thesuperzapper for taking over the job!)
the natural path will be to migrate to the official chart, and thus we can point to the comparison/migration document.
newcomers should use the official chart, and if really want some key features absent from the official chart, use this one and be prepared to migrate once desired feature is ready in the official chartt

But we may need help from the airflow community for this documentation effort.

potiuk commented 3 years ago

The differences between the two of them might be quite tricky to document, maybe we can start a new markdown document (MIGRATION.md?) in here that will evolve over time with the status of this chart and the status of the official chart. It would do:

Happy to help with that.

potiuk commented 3 years ago

Hey @thesuperzapper . I looked at th description of ASF policies and I think (without reaching out to the trademarks team) that you should change the logo/description to include "Powered By Apache Airflow".

It is nicely described here: https://www.apache.org/foundation/marks/contact#products

USES IN SOFTWARE PRODUCTS

Since the primary goods of any Apache project are software products provided free of charge to the general public, 
we generally do not allow third parties to use Apache brands in their software product branding.

You may be eligible to use our Powered By branding style in your software product naming,
which does not require permissions as long as you follow all aspects of the policy, in particular
always using the "Apache Projectname" form.

Separately, we offer a set of Powered By Apache logos that you may use to show
support for or use of various Apache software projects.

And the https://www.apache.org/foundation/press/kit/ contains information and tools on how to create "Powered By" logo using the original project's logo. This is to avoid the confusion which project is the "Apache" one and which one is just "Powered by Apache".

Are you ok with changing this accordingly? Or should we reach-out to trademarks@a.o to clarify ? In the latter case I will need your e-mail (you can contact me via my email on my Github Page in such case).

thesuperzapper commented 3 years ago

@potiuk I have replaced the Airflow logo with the "powered by airflow logo", and changed the description to the following (which should remove any possibility for confusion):

the community-maintained descendant of the stable/airflow helm chart

potiuk commented 3 years ago

Thanks a lot @thesuperzapper ! Would it also be possible to change the logo here: https://artifacthub.io/packages/helm/airflow-helm/airflow

potiuk commented 3 years ago

BTW. I think the "Powered by " logo would look nicer with just Airflow logo :)

For example:

powered_by_airflow

kaxil commented 3 years ago

Thanks, @thesuperzapper and @gsemet for your comments. Firstly, I appreciate all the efforts in maintaining this chart for several years and helped the community -- so just wanted to says Thanks on behalf of the community and Airflow PMC members.

Secondly, just want to add some notes to the comments regarding feature-parity:

there are some features which are not present in the "official" chart yet:

a (somewhat) comprehensive set of docs

The Docs for the stable version of the official Helm Chart are at: https://airflow.apache.org/docs/helm-chart/ with a separate page for references of values.yaml with examples.

It will keep on improving and will love any and all feedback on what we can include over there.

lots of values validation (to detect common configuration issue we have seen over the years)

ACK, any help there would be 100% appreciated. You and team have done a great job at that.

support for most past versions of airflow (including 1.10.X)

The official Helm Chart does support old versions of 1.10.x too -- please do let me know if you find any issues while running any 1.10.x version

post-install jobs, for things like connections, variables, pools, users, etc

Pools -- maybe, needs discussions. But for all the others we don't need to have post-install jobs, since they can be done via Secretes Backend including Environment Variables.

a feature like extraPipPackages (which is great for testing, and honestly used by many in production)

This is definitely a NO-go for the official Chart. All the dependencies installation (system + Python) should be done in the docker-image itself to make an immutable image for the Helm Chart to consume. It does not make sense to install system dependencies in docker and install Python via Helm Chart. For migration, users can just install those deps in the docker file here. So we have a clear migration path. Maybe we can add docs for it.

support for git-sync with HTTP auth (I may be wrong on this one)

We do support that, in fact we allow users to use any git-sync parameters via Environment Variables. Please correct me if I am missing something

Note, I'm open to contributing to the "official" chart to help move this process along, I'm just not sure how best to start.

Would love to have a Migration guide in both places -- https://airflow.apache.org/docs/helm-chart/stable/ as well as this repo itself. PRs are very welcome

andormarkus commented 3 years ago

@kaxil Does airflow support updating the Secretes Backend from dag? Based on my limited experience I can not update the value like I can do on a "local" variable.

kaxil commented 3 years ago

@kaxil Does airflow support updating the Secretes Backend from dag? Based on my limited experience I can not update the value like I can do on a "local" variable.

Yes if using the "built-in" backends like Metadata DB. No, if using an external Secrets Backend like Hashicorp Vault. We added the concept of "Secrets Backend" in Airflow to allow users to Manager "Secrets" in an external system that is dedicated to just secrets like Vault which allows rotation of secrets, allowing read-only mode so sys-admins can control modification of it.

If you want to just use a "variable" to modify in the DAG why are you setting it as Variable? You can just use an Environment Variable inside your DAG to override the Value.

BTW while the secrets the righ-fully read-only -- if you have permissions to edit those external systems -- you could use VaultHook (same as other Hooks) to modify the secrets: https://airflow.apache.org/docs/apache-airflow-providers-hashicorp/stable/_api/airflow/providers/hashicorp/hooks/vault/index.html#airflow.providers.hashicorp.hooks.vault.VaultHook.create_or_update_secret

MarkusTeufelberger commented 3 years ago

This is definitely a NO-go for the official Chart. All the dependencies installation (system + Python) should be done in the docker-image itself to make an immutable image for the Helm Chart to consume.

That's not encouraging to hear, since this means that the official chart is not useful for me at all. I do NOT want to maintain a build infrastructure + Docker registry just to do a pip install foo_dependency on top of an upstream container. There are other ways to solve this issue (init containers for example), but if "all dependencies need to be baked into the image from the get-go" is a hard requirement for the official chart, then the official chart is dead in the water for me.

Maybe you misunderstood how extra package installation works in this chart here? There's really not much difference between git sync and pip installation for example.

potiuk commented 3 years ago

That's not encouraging to hear, since this means that the official chart is not useful for me at all. I do NOT want to maintain a build infrastructure + Docker registry just to do a pip install foo_dependency on top of an upstream container. There are other ways to solve this issue (init containers for example), but if "all dependencies need to be baked into the image from the get-go" is a hard requirement for the official chart, then the official chart is dead in the water for me.

I think this is indeed up to discussion on how to answer some of the needs. I don't think the way in this helm chart is good. It has it's own challenges, but maybe there are other ways similar behavior can be achieved.

I agree with @kaxil that the image should be immutable and this is specifically why I have designed the official docker image to be both - extendable and customizable (and I plan to add one or two more features for example automated installation from requirements.txt if they are placed in docker-context-files - similarly to ON-build feature of astronomer's image. This should definietely be default way how people should be using the images in K8S environment, and it's the only one that is sustainable at scale. I cannot imagine building and reinstalling images when you have distributed cluster, where each cluster would effectively rebuild the image potentially leading to different images at diffferent clusters. This is a NO-go for any "serious" usage of Airflow.

However I see that for hobbyist usage or in case when you just want to try things out and test the waters, an easier way of installing the packages, can be achieved.

And you do not need even helm chart specific support for that. I see that more of answering similar needs as the Quick-Start of Airlfow is done: https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html.

We can easily built in a custom init_script that can actually perform the installation of requirements.txt file if it is present in a specific location whenever the image is started, also we could allow for mounting a cache of those installed dependencies automatically to speed up restarts of such dependencies.

As of recently we have the way to add custom init containers in the helm-chart, and we could also easily add such init script to the image and describe how to enable it for your helm chart if you want to do it. We could even have it as an option (clearly marked as development/test one) in the chart itself to use such init containers and clear path on how to migrate to a "serious" solution (i.e. how to easily build your own image using the very same requirements.txt using the mechanism I planned to add to the docker image). But I think just having an entrypoint and a possibility to mount it in the helm-chart with generic mount is good-enough.

@MarkusTeufelberger -> would that work for you and @kaxil - WDYT?

kaxil commented 3 years ago

@MarkusTeufelberger git-sync runs as a side-car and it won't affect running Airflow where as with initContainer and pip install we are talking about changing/updating dependencies. That being said, it is possible for you to achieve the same with the official chart.

The official chart allows adding extraInitContainers (and extraContainers) and Webserver, Worker and Scheduler.

https://github.com/apache/airflow/blob/4bdf46d2284c06d82a2574831028cbddc520ca8f/chart/values.yaml#L408-L411

So you can run your pip install command over there and use configmap for requirements.txt if you'd like. You can even further use pip constraints file or even piplock file like @gsemet mentioned below.

What we don't want to do is "not recommend" doing that for production usage. I think the official Chart is now highly extensible, so if you know what you are doing, by all means, you can add all many more things in those extraInitContainers as you like. But we would want to avoid users who don't know the best practises.

Hopefully that will take care of your case. We can even add docs around that for "hobbyist" and make it clear that do not use it for Production -- that should hopefully take care of case @potiuk mentioned.

gsemet commented 3 years ago

To ensure a full reproductibility of the dag, the good option would be to use a lock file (using pip tools for instance). Because I agree, burning the dag image is costly for users, and relying on a pip install during the initialization of the container does the job if all dependencies are locked in a way or another.

potiuk commented 3 years ago

To ensure a full reproductibility of the dag, the good option would be to use a lock file (using pip tools for instance). Because I agree, burning the dag image is costly for users, and relying on a pip install during the initialization of the container does the job if all dependencies are locked in a way or another.

That's exactly what requirements.txt is supposed to do. PIp tools are not needed for that, Actually with recent changes with the way how resolver works and some nice features of pip, I think the needs for pip tools or poetry kind of solutions are gone. They used to be helpful, but in airflow ecosystem we sticked with pip and with PIP 21 maturing we strongly recommend just using PIP for all things around python dependencies.

Example requirements.txt:

library==0.5.0

You can even use all kinds off ~= or > if you want.

With 'pip freezeyou can even get therequirements.txt` for all your transitive dependencies as well.

gsemet commented 3 years ago

Usually you put loose dependencies in requirements.txt because updating it is quite boring. Pip-tools does the pip freeze for you.

the best would be to use poetry/pep517.

I do not use pip directly even the latest one, I don’t see how this would change anything if you do not have another “lock” file.

i do not speak about things I don’t know, I just advise about things I do everyday on dev environment that have heavy reproductibility requirements. you do not want your install to fails when a new bugged package your require on is published on poly. Poetry works. Pipenv also but it is substandard compared to poetry. Pip-tools to freeze a requirements.txt.in to requirements.txt also works. Don’t know if pip can handle a lock file natively from now.

potiuk commented 3 years ago

Don’t know if pip can handle a lock file natively from now.

Yes it can. We use it all the time in airflow:

Simply run pip freeze > constraints.txt - it will produce a requirements/constraints file with all dependencies containing == for all dependencies (including transitive ones). The poetry .lock is equivalent to that, However poetry lock is more akin to "install_requires" in setup.py than constraints. And till today neither poetry nor pip tools are good for Airflow way of handling dependencies.

Let me explain where I am coming from - so that you could understand the context better - I am not just dismissing pip tools or poetry - for some usages they might have nicer interface but they are not good in case of Airflow.

Poetry (and pip tools) are - unfortunately - very opinionated in the way they work and do not handle well the way Airflow works - because Airflow is both library and application. Both poetry and pip tools made conscious decision to handle those two cases separately - either "open dependencies" for library or "fixed" dependencies for application. I used to believe this is correct (and it is in most cases) until I entered Airflow world ~ 3 years ago. Back then I also wanted to use poetry but it turned out it is not possible and it is not even today.

Airflow (as application) has to have "golden" set of constraints (that allow it to be installed even if new dependency versions are released) but it should also (as library) allow to upgrade those dependencies when new versions are released without upgrading Airflow itself (so that people could write DAGs using newer versions of the dependencies) . Airflow when installed with all providers has ~ 500 dependencies (including transitive ones) and it makes it one of the most complex, when it comes to dependencies, projects in PIP. That's why we generate and use constraint files to install Airflow and the only tool of the three (pip/pip tools/poetry) - only PIP can handle constraint files.

Another project using constraints in similar way to ours is openstack https://github.com/openstack/requirements/blob/d0b389d840c75216ab2cc10df849cb98990b1503/tox.ini#L10

There are two open issues currently in Poetry (partially inspired by the way how Airflow and Openstack need the constraints):

The original requests mentioning Airflow (closed long time before I commented on it) is here https://github.com/python-poetry/poetry/issues/3225 and I explained there why constraints are needed.

Until constraints are supported by other tools, I can only recommend PIP, unfortunately.

potiuk commented 3 years ago

BTW. We have a work in progress to move to PEP517 https://github.com/apache/airflow/pull/16144 (while keeping PIP/constraints).

gsemet commented 3 years ago

Ok, if your answer is pip freeze > requirements.txt, you don’t know what poetry and pipenv do and more generally what lockfile are for. They are here especially because pip freeze is not enough.

there is nothing special about airflow dependencies. Basically every serious Python application needs to freeze the requirement somehow. I manage a dozen apps at work with real, reproductivity requirements, and even pep517 is not enough actually.

Pip freeze is just the first level of doing it, and I can bet you (your airflow user) will do it wrong at one point.

Pip-tools will do it slightly better from the user workflow (ie, requirements.txt.in with loose requirements and requirements.txt with the result of pip freeze), then you can use pep517 process like airflow. I got kicked out of my proposal to integrate a lockfile mechanism in airflow back when pep517 wasn’t released yet, and I am kind of sad to see it is still a thing in 2021, where we have an official and efficient way to defining, updating and freezing dependencies like poetry and more generally pep517.

in short, if the Python dependencies can be managed by a lock file in the dag (frozen requirements.txt, Pipfile.lock or poetry.lock), this would be equivalent of burning the deps in docker image, from the reproductibility point of view. So yes, installing dependencies from a requirements.txt is a good thing to have, but if you want your user not to do big mistakes, use pip tools or poetry.

potiuk commented 3 years ago

Can you please explain how you would want the user to install airflow from PIP with the fixed set of requirements using poetry and allowing the same user to upgrade the dependencies later?

I would really love to see that.

Please.

potiuk commented 3 years ago

Pip freeze is just the first level of doing it, and I can bet you (your airflow user) will do it wrong at one point

This is done automatically using our CI process. We try to upgrade all requirements and only when the test passes, we regenerate the constraints using pip freeze. This is done AFTER the tests pass, so we know that those constraints are not broken. No airflow user ever runs it manually.

potiuk commented 3 years ago

And BTW. @gsemet - if you hold a grudge for your proposal not accepted, sorry - this happens. But there is nothing wrong in making a new proposal now. Simply start a discussion at the devlist and explain in detail how the workflow will look like, how users of airflow will be installing it in the future even if someone releases a broken dependency and how they are supposed to upgrade those dependencies when they are not broken (without reinstalling airflow). No more no less. I would love to take part in the discussion if you would like to elaborate.

gsemet commented 3 years ago

Managing dependencies is hard, and poetry makes is quite user friendly. But yes there is no easy answer.

in real life your dag developers will probably want at one point an additional module than the one in the prebuilt docker image, to test it, try something new, so letting them add these additional lib in requirements.txt that will be installed at boot is nice. Not ideal, but it helps.

potiuk commented 3 years ago

in real life your dag developers will probably want at one point an additional module than the one in the prebuilt docker image, to test it, try something new, so letting them add these additional lib in requirements.txt that will be installed at boot is nice. Not ideal, but it helps.

Yep. This is precisely that we just agreed to do (pending @kaxil's comment) :). I think this will be done even without init-containers. This will probably be easiest and we do not have to have any modifications in the helm chart.

We already have similar features in the dockerfile for this kind of "hobbyist/try-out" features: https://airflow.apache.org/docs/docker-stack/entrypoint.html#upgrading-airflow-db all denoted with _* variables to make it clear those are non-production settings (for example _AIRFLOW_DB_UPGRADE or _AIRFLOW_WWW_USER_CREATE

I am thinking of adding a _PIP_ADDITIONAL_REQUIREMENTS variable (which might contain the content of requirements.txt:

_PIP_ADDITIONAL_REQUIREMENTS=$(cat requirements.txt)

or manually crafted dependencies

_PIP_ADDITIONAL_REQUIREMENTS="package1==1.0.0 package2=2.0.0"

If the image handles that via entrypoint - then no helm chart would be needed, packages will be installed always before the airfow command gets executed and everyone would be happy. @kaxil WDYT?

@MarkusTeufelberger => would that be ok for you? The only thing that you would have to do is to add in your values.yml

env:
   - _PIP_ADDITIONAL_REQUIREMENTS=" ...."

Would that work for you ?

MarkusTeufelberger commented 3 years ago

Sure, I just want a (single) canonical way to have libraries I need installed when airflow starts up while also directly using the unmodified upstream docker image. I don't really care where and how exactly I have to define these, but it would have been a downgrade if I had to come up with my own solution using init-containers and extra volumes.

potiuk commented 3 years ago

I created the PR here to support it: https://github.com/apache/airflow/pull/16170 - it includes the documentation update.

Par of the documentation here:

+----------------------------------+-----------------------------------------------------+--------------------------+
| ``_PIP_ADDITIONAL_REQUIREMENTS`` | If not empty, airflow containers will attempt to    |                          |
|                                  | install requirements specified in the variable.     |                          |
|                                  | example: ``lxml==4.6.3 charset-normalizer=1.4.1``.  |                          |
|                                  | Available in Airflow image 2.1.1 and above.         |                          |
+----------------------------------+-----------------------------------------------------+--------------------------+

kaxil commented 3 years ago

Yeah that sounds good (but it will also install packages evertime pod is run). @MarkusTeufelberger just to reiterate installing pip packages via initContainer is a BAD solution, it means your packages are needlessly installed everytime the pod is restarted, that is BAD. Convenience does not mean a GOOD solution. Example Issue: https://github.com/airflow-helm/charts/issues/104

This is why installing them once in your docker file is a good solution. Anyone debugging tons of Production issues will tell you that. Please trust us on this if you can.

What I proposed in the last comment is because you know what you were doing and wanted to continue doing it. We don't want users to do that unless they know.

potiuk commented 3 years ago

Yeah. Fully agree @kaxil and just to reiterate @MarkusTeufelberger - I cannot imagine this to be used in production. It has a ton of problems when running at scale - the big drawback is that when it grows, it is much slower to bootstrap. One thing for example that you might face soon will be automated, uncontrollable upgrade of airflow which you might not realize. New providers (will be released this week likely) will be 2.1+ only - which means that if you install a new provider this way, it will .... upgrade airflow together with it if you happen to use airflow 2.0. Those are the kind of problems that you DEFINITELY do not want to find out about when your container start already and operate on live database. That's why building your own image is a much better solution.

This change is similar to what was used in the helm chart of @gsemet and @thesuperzapper and it is only useful for quick testing (in the doc update in the PR I made it very clear that production-ready solution is to build your own image).

But also it is a bit "better" than the helm solution:

1) no need for init-containers 2) no problems similar to those experienced in #178 (solved by https://github.com/airflow-helm/charts/commit/b2eb0d99b280a58c356f3d427b593279f2773ce2) 3) faster bootstrap than the solution above - the solution for #178 required to copy installed dependencies and mount the copied folder over from the init container to the main one 4) works in the same way for our quick-start Docker-Compose.

MarkusTeufelberger commented 3 years ago

To be honest, all I need is a mysql client library and a helper API for a REST-API of a proprietary product. I'll keep your warnings in mind if I actually need to install anything with airflow in its dependency graph.

This is why installing them once in your docker file is a good solution.

Unfortunately usually Dockerfiles are not deterministically built and unlike docker-compose, Kubernetes does not have a native path from Dockerfile --> Docker image --> running pod so it is up to every user to come up with their own build process for this scenario.

Forcing (or even recommending) every user to continuously maintain a fork of an official image is a BAD solution in my books. #104 can be solved by installing dependencies into a volume and attaching that to pods instead of using init containers, but that's already a solution suggested there. The thing I want to have is an unmodified upstream container that I do not have to modify on each release with additional packages my plugins need installed and available.

For the record, installing dependencies at pod startup time is also far from what I would see as an ideal solution to this problem ("I need libraries foo and bar that are not part of the official image available on my DAG runs"), but it at least doesn't place a heavy burden on the operator of the infrastructure. Building your own images is definitely a solution to this problem (and comes with its own issues). The cost for doing that (maintenance, infrastructure) is nontrivial though.

potiuk commented 3 years ago

Forcing (or even recommending) every user to continuously maintain a fork of an official image is a BAD solution in my books. #104 can be solved by installing dependencies into a volume and attaching that to pods instead of using init containers, but that's already a solution suggested there. The thing I want to have is an unmodified upstream container that I do not have to modify on each release with additional packages my plugins need installed and available.

Just FYI. No fork is ever needed. The official Dockerfile is super versatile precisely for the reason that you should not have to fork it. It's specifically designed to handle two scenarios though:

1) easy extending the image via "FROM: directive" (this one uses unmodified airflow upstream as a base). 2) more complex and flexible customizing the image via providing build-args, requirement files and custom packages

See the https://airflow.apache.org/docs/docker-stack/build.htm

The problem with Airflow is that you cannot have single 'binary" upstream image to handle all cases - because airflow deployment has so many varants (more than 60 providers you can choose from) + many people have their own custom requirements.

The case 1) is super simple, you just have your own simple Dockerfile with FROM: apache/airflow clause and you can add your own dependencies there. True, you need to have a process to maintain your "custom" additions and store the image, but if you are getting into trying to use Kubernetes, this is what you already committed to from day one.

The basic premise of K8S deployment is to use internal registry and images. You won't escape it no matter how hard you try. Each K8S installation already has internal image registry. It's part of K8S architecture. And if you have locally built image you do NOT need additional registry. You simply build image locally, push it to the registry of K8S and that's it. Same as you install Helm Chart updates. This is just another command to run. You load your locally built image to K8S internal registry and it is immediately available for all pods. No additional registry is needed.

Maybe that's where your "non-trivial" understanding is from is that you think you need something else. No you don't. You already have everything you need the moment you have K8S installed.

For example when you use 'kind' as your K8S cluster, those are literally two commands:

docker -f YOURDOCKER_FILE . build -t <IMAGE_NAME> .....<whatever args you want to add> 
kind load <IMAGE_NAME>

That's it. You are done.

Every K8S cluster deployment has similar capabilities. I am tempted to name it 'trivial" actually.

Case 2) is more complex but also more powerful. The official image with "customization" is WAY more versatile. And the whole premise with it is that you can build your very custom image without forking it at all. It supports out-of-the-box:

optimizing the image for size heavily using multi-staging approach (development dependencies are not present in the final image)
installing using custom PYPI configuration (using .pypirc) -> with own in-house repos, proxies, authentication whatnot
adding extra dependencies - both apt and PYPI at various stages (you can for example install optimized version of packages that require compilation step without unnecessary increase in size of the image)
choosing set of airflow extras/providers that will be installed
choosing custom constraint files where you can manually keep versions of dependencies to install and add your versions
capability of building the image in air-gaped environment (where you are not connected to internet, you only use custom apt sources via proxies and locally prepared and security-vetted python packages) - this is a requirement for some high-security environments like banking or military to mitigate supply-chain attacks

All this is available in the image without the need of keeping the fork. You just have to place your own custom files in "docker-context-files" and pass the right build arguments.

Does it look like, you can do better than that? I seriously doubt. If you would like to use K8S, I think you should simply use it as intended. The dynamically updating image while running is simply abomination. Something that is unnecessary and harmful.

MarkusTeufelberger commented 3 years ago

Each K8S installation already has internal image registry. It's part of K8S architecture. [...] This is just another command to run. You load your locally built image to K8S internal registry and it is immediately available for all pods. No additional registry is needed.

It seems you're mixing up a local image cache of some CRI implementations and a container registry? Otherwise, please point to the documentation of the internal registry component on https://kubernetes.io/docs/concepts/overview/components/

Every K8S cluster deployment has similar capabilities. I am tempted to name it 'trivial" actually.

Mine don't even use docker as CRI and support for dockershim has been removed from kubernetes already. Yes, I can build containers in there using other means (my personally preferred one would be kaniko), but a I said that's a personal choice, not something that comes with Kubernetes.

you can build your very custom image without forking it at all

By "forking" I meant "taking on the responsibility of testing changes and deviating from upstream builds + release artifacts". "Custom image" kinda implies that. Option 1 just takes your version as the base, Option 2 means I need a completely custom image build process that might or might not work at all with no way of getting any support from upstream. It means that I either add a layer on top of the official image with my libraries or I completely rewrite the way the image is built using the way the official one is built as a guideline. Both will result in an image that's 100% unique to my process and that will need a rebuild whenever upstream releases either a new image or software version. There's no documentation on how to do that in place (because it concerns something outside of both helm and airflow) and everyone who wants to do this correctly will have to invent their own process to get there.

https://github.com/apache/airflow/blob/f47e10c3885a028e7c45c10c317a7dbbff9e3ab9/Dockerfile#L326 alone already makes your released image no longer reproducible easily and if someone goes for your "case 2" approach, they can run into real trouble.

Every K8S cluster deployment has similar capabilities.

I really wish that they would. It would make a lot of things easier. Now try your example with a more production-like cluster deployment such as https://www.talos-systems.com/platform/talos-os-for-kubernetes/ - good luck.

Does it look like, you can do better than that? I seriously doubt. If you would like to use K8S, I think you should simply use it as intended. The dynamically updating image while running is simply abomination. Something that is unnecessary and harmful.

Passive(?)-aggressiveness aside, another alternative would be to vendor libraries with custom plugins (check whole libraries with the plugins that use them into git in a state that they can be imported from that folder), so they are "installed" using the git-sync mechanism rather than upon startup of the pod. It is something that I am not too fond of as a solution, but it would also mean that there's a versioned and stable instance of a dependency available that can be managed outside airflow itself. I'm just not sure if that's a better (or intended) way to add dependencies while sticking with unmodified upstream images.

potiuk commented 3 years ago

Mine don't even use docker as CRI and support for dockershim has been removed from kubernetes already. Yes, I can build containers in there using other means (my personally preferred one would be kaniko), but a I said that's a personal choice, not something that comes with Kubernetes.

Of course. If you watch my "Production Docker Image" talk from last year's summit - https://www.youtube.com/watch?v=wDr3Y7q2XoI it's where I specifically talked about "containers" vs "docker" and try to pass the message. But - I failed and went back to using Docker as name. As unfortunate as is is, seems that (similar as in other cases) Docker and especially "Dockerfile" that does not have other name ("Containerfile"???) remained as "de-facto" name for anyone using containers.

I already stopped correcting people who are using Docker when they could use containers and corrected our documentation back because people were confused. Docker is so deeply "embedded" in the whole process that people will not understand if say "container". I advise you the same (even if it's not technically correct) as it makes it easier to pass the message.

But yeah. You can use kaniko, podman or others to build the image. That was just an example. No more, no less.

It seems you're mixing up a local image cache of some CRI implementations and a container registry?

Correct. Thanks for pointing this out. This is the internal cache I referred to. Indeed it's not a fully-fledged registry. Which does not change the fact that you do not need external registry and you can simply load the image - especially when you have K8S installed for "test" purpose (Again - I cannot imagine anyone trying to dynamically update their image in production in "fully-fledged" cluster. You can contest that and dislike it, but this is simply how it works. If you chose K8S for deployment, I think you should live with consequences of the choice (or maybe try to change it by submitting proposals to K8S or by changing K8S to something else).

Both kind and minikube support the "load" command to make your image available for the cluster (in the cache, not registry as you rightfully corrected me).

I really wish that they would. It would make a lot of things easier. Now try your example with a more production-like cluster deployment such as https://www.talos-systems.com/platform/talos-os-for-kubernetes/ - good luck.

Here is the way how you do it in Talos: https://www.talos.dev/docs/v0.7/guides/configuring-pull-through-cache/#using-caching-registries-with-docker-local-cluster - it is actually even easier, rather than loading the images to the cluster, you configure Talos to load the images from local docker cluster cache. So all you need to do is to build the image. Can be used on Air-Gapped systems and will never reach out to any registry. Very handy actually. No registry needed. Which was my point actually.

Passive(?)-aggressiveness aside

Nope. I am just straightforward. K8S and docker were never meant to add extra "software" layers after images are built. Trying to do that "live" is abomination and I am happy to repeat that. Syncing/installing packages using Git-sync is a terrible idea as well. Container "layers" are the way to add new software on top of existing layers. This is how Containers work. This is what they were creating it. By installing software additionally inside Kubernetes, rather than adding them outside (by building containers) is terrible idea as well. Using Git to synchronize packages is really bad idea. If you have DAGs (which are source code), yeah - using Git-sync is great idea. But using it to synchronize binary packages to install, is - again - abomination. And I am not afraid of that word.

Both will result in an image that's 100% unique to my process and that will need a rebuild whenever upstream releases either a new image or software version. There's no documentation on how to do that in place (because it concerns something outside of both helm and airflow) and everyone who wants to do this correctly will have to invent their own process to get there.

Documentation for the process is here: https://airflow.apache.org/docs/docker-stack/build.html#customizing-the-image

It's rather comprehensive but imperfect (and we are improving it continuously and make the process simpler and more stable. This is precisely why I waited with making the image "official status". AIP-26 is still not closed because of that. There are two tasks left (because I wanted to iterate with the users of the image to work out simple yet powerful ways of building the image that will be stable in the future as well. See https://github.com/apache/airflow/projects/3.

The two steps left:

1) Have a separate (automatically generated) repository where users will only get Dockerfile + needed fles (not the whole Airflow) in order to be able to customize the image

2) Submit the image to "DockerHub" to gain the "official" image status.

And I am going to add full backwards compatibility checks when this happens.

potiuk commented 3 years ago

And a comment here:

https://github.com/apache/airflow/blob/f47e10c3885a028e7c45c10c317a7dbbff9e3ab9/Dockerfile#L326 alone already makes your released image no longer reproducible easily and if someone goes for your "case 2" approach, they can run into real trouble.

Yes. Full reproducibility is a non-goal for me (and for the image). The much more important feature for me is that whenever we build the image from scratch we get the latest security fixes with it. Not a 'reproducible binary copy' of what was there before. Plus all our build process has automated tests that test the built images - we have a sequence of automated tests in CI with every single Pull request that we run whenever we build the image, which checks a number of things. And yes this is a deliberate decision.

talnicolas commented 3 years ago

Just a bit curious since I see that the official helm chart is going full Astronomer and we ran a POC a few months ago, I would like to check if one really problematic thing has been fixed since then in the deployment process.

Since you build an immutable image for all the components at each change, whenever we were doing any change, being on the scheduler, webserver or worker, all the components were being redeployed/restarted. It was not that much of an issue losing the UI, just bad user experience in my opinion (at least the rocket was cool looking), but having the scheduler being restarted and not scheduling any task each time you do a change that is not even scheduler related (just to deploy a new DAG!!) for me was a real aberration and I don't understand how it makes sense in a production environment with time critical workflows.

Is it still the behavior proposed by the official helm chart today or has a different strategy been put in place @potiuk @kaxil ?

kaxil commented 3 years ago

re: Official Helm Chart

@talnicolas The Webserver does not restart for the official Helm Chart we have a RollingUpdate for Airflow >= 2.0.0 since we have made DAG Serialization a requirement from 2.0.0.

https://github.com/apache/airflow/blob/9351e2a2d43b6ec3c019df19afb445a3dcf587ac/chart/templates/webserver/webserver-deployment.yaml#L39-L55

You can also use git-sync for DAG deployment or persistent Volume -- so DAG deployment is separate.

re: Astronomer Platform & restart

The Webserver does not restart over there too anymore for Airflow >= 2.0.0 :) It also uses a rollingUpdate, we have more support now for DAG deployment too, Astronomer supports Volume Based deployments so no longer Scheduler or worker restarts are required :) Do reach out if you want to try it again.

potiuk commented 3 years ago

but having the scheduler being restarted and not scheduling any task each time you do a change that is not even scheduler related (just to deploy a new DAG!!)

Also to add to what @kaxil wrote and relate to your comment, I think deploying DAGs by running "helm upgrade" command to deploy new DAGs in production sounds super-weird to be honest.

Helm Chart should be used to deploy "application" not to deploy another version of DAGs. DAGs are not "part" of airflow application. There are other methods - specifically git_sync that @kaxil mentioned which is also part of the official helm chart - there deploying new dags happens in the background without restarting pods.

cocampbe commented 3 years ago

All I can say is, "Ugh". After all this time an 'official chart' is created. Those of us who wanted to deploy to k8s sooner than later found this chart and have been using it in production for quite sometime. I agree with many of the arguments provided. I currently use the pip install of extra packages for differing providers. It works for me. It does take a while for the pods to start. Maybe 23 seconds. But after that, it's running.

The argument that a image is immutable is not a strong one IMO. Images are immutable. But a running pod/container is not. Whether it's baked into the image, or not, is debatable. And I think both sides can provide some really good arguments. Personally, I think users should have some flexibility being that we are discussing OSS. The fact is @thesuperzapper provided a solution that was pretty much abandoned and now there is a new release. I feel like it would have been better to approach @thesuperzapper many moons ago to see if he wanted to collaborate on an official release. Now there is schism, and it will likely be difficult for some of us to migrate to the official release. Color me super excited with trying to figure out how to migrate if we choose to.

MarkusTeufelberger commented 3 years ago

https://github.com/apache/airflow/pull/16170 is the one that adds the pip install at pod start feature to the "official" chart, so that should already be fixed in the most recent release or an upcoming one (a bit hard to say, as https://github.com/apache/airflow/releases is a mess and it is hard to find anything there).

potiuk commented 3 years ago

I feel like it would have been better to approach @thesuperzapper many moons ago to see if he wanted to collaborate on an official release. Now there is schism, and it will likely be difficult for some of us to migrate to the official release. Color me super excited with trying to figure out how to migrate if we choose to.

Diversity of solutions is a value on its own. There is NOTHING wrong with having two solutions which are parallel. We discussed it many moons ago @cocampbe - you can search the airlfow Devlist and we decided to go with the Astronomer-originated solution officially supported by Airflow community (please search the archives). And it will continue to be so, including 3 people from Astronomer working hard few last months to get it super-robust, handling many cases and most of all fully end-2-end-tested (including full integration and unit test support - which is a HARD requirement for everything we do as Airflow Community. Just go and see how "robust" and "future-proof" it is.

And we are not ad "odds" with @thesuperzapper. We actually collaborate to some extent - including @thesuperzapper hosting Melbourne chapter of Airlfow Summit. Heck - if you read the thread from the very beginnnig, we even offered our way in helping people to figure out ways to migrate to community chart and see whether we can figure out some "other" ways of handling their needs if we don't agree with the way it's done in this chart.

The PR #16170 that @MarkusTeufelberger mention is a proof of that. While we think using "pip install" for production-grade image during PIP start is plainly wrong, we recognise the "development" and "test" value of it, so we added it (as _development-targeted option with a clear warning printed to the users that it should not be used in production (which you can of-course ignore but then you will not get community support for any issues that might result from ignoring it).

It's even more than that in fact. Because we not only enabled this development mode for Helm users, but also for Docker-Compose users (because we've implemented it in the official Airflow Image, not in Helm nor Docker-Compose).

And BTW. this is totally fine if you pick the @thesuperzapper' s Helm. Feel free. No-one is forcing you to use the official Apache Airflow-Community managed one. You just have to be prepared that Apache Way driven Airflow Community will not support your problems if you have them and the community built by the @thesuperzapper around the other chart will do it.

It also means that we will make different choices and decisions (this is one of the reasons we decided to go with our own approach because we did not agree with some design choices in the helm chart of @thesuperzapper). And there is nothing wrong with that. It's OK to have different opinions and make different decisions as long as you discuss it and weight those other options in,

Just let me repeat it again - we are not Helm-centric, neither even K8S-centric. We are Airflow-centric. Helm for us is only one of the ways how Airflow is deployed and it's very narrow view for us to limit the choices we make to only that part of our users. That what is big part of our "product" thinking. Our approach has te be much broader that choosing one deployment option over the other. I understand it "looks" like the pip install solution is better from your perspective but this is a very, very narrow (and I could even say in some way pretty 'selfish') view. We have to consider and take into account opinions, needs, expectations of many of our users and build a consistent product around that - sometimes making choices that makes some people a bit less happy than they could otherwise. And we run and analyse a Survey among Airflow users to be better informed on that.

If you wish to learn more about that - why we are doing it, how we took into account result of Airflow Survey that we run every year and what is the philosophy behind it, I heartily invite everyone who watches the discussion to watch our Airflow Loves Kubernetes talk with @kaxil at the Airflow Summit in two weeks. Some decisions we've made will be easier to understand then.

potiuk commented 3 years ago

as https://github.com/apache/airflow/releases is a mess and it is hard to find anything there).

@MarkusTeufelberger - I think you have that impression,, because you are looking in wrong place. "Releases" is simply a snapshot of whatever we already released (including providers). This page is automatically prepared by GitHub and with the amount of releases we have (including providers) it does look overwhelming. That's why in the official Airflow documentation we publish the Changelog which is much cleaner : https://airflow.apache.org/docs/apache-airflow/stable/changelog.html and each provider has it's own separate changelog (for example Amazon's one is here: https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/index.html#changelog )

However, If you want to see what's coming. you have to look at the milestones: https://github.com/apache/airflow/milestones where you will see it neatly organised around 2.1.1 (due to be released today), 2.1.2 (soon), 2.2 (which is coming in few months) and 3.0 (which is empty for now but after the summit you will see more stuff planned there, I am sure).

cocampbe commented 3 years ago

@potiuk I have apparently poured salt on a wound and I apologize for that. It's just frustrating.

potiuk commented 3 years ago

@potiuk I have apparently poured salt on a wound and I apologize for that. It's just frustrating.

No salt :). No wound :). Just wanted to provide the full context. It's just my way of over-communicating when I see that some assumptions simply miss some bigger context. Really I am not frustrated or angry if you read it this way. I think if you visit devlist and issues more often you'll see I am just trying to be helpful (and apologies if it looks a bit 'patronising', I know it might certainly look like that but It's really not intended to be.