apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
37.33k stars 14.34k forks source link

Difference of extras Airflow 2.0 vs. Airflow 1.10 #12744

Closed potiuk closed 3 years ago

potiuk commented 3 years ago

Description

When airflow 2.0 is installed from PyPI, providers are not installed by default. In order to install them, you should add an appropriate extra. While this behavior is identical in Airflow 1.10 for those "providers" that required additional packages, there were a few "providers" that did not require any extras to function (example http, ftp) - we have "http", "ftp" extras for them now, but maybe some of those are popular enough to be included by default?.

We have to make a decision now:

Use case / motivation

We want people to get a familiar experience when installing airflow. Why we provide familiar mechanism (with extras) and people will expect a slightly different configurations, installation and we can describe the differences, maybe some of those providers are so popular that we should include them by default?

Related Issues

12685 - where we discuss which of the extras should be included in the Production Image of 2.0.

Additional info

Here is the list of all "providers" that were present in 1.10 and had no additional dependencies - so basically they woudl work out-fhe-box in 1.10, but they need appropriate "extra" in 2.0.

Also here I appeal to the wisdom of crowd: @ashb, @dimberman @kaxil, @turbaszek, @mik-laj. @XD-DENG, @feluelle, @eladkal, @ryw, @vikramkoka, @KevinYang21 - let me know WDYT before I bring it to devlist?

mik-laj commented 3 years ago

you should add an appropriate extra.

I am concerned that this is a good idea. I think it would be worthwhile for the user to pin a specific version so that they do not accidentally install a newer version that may contain regressions.

turbaszek commented 3 years ago

I think the http should be part of core, see discussion in https://github.com/apache/airflow/pull/12252

kaxil commented 3 years ago

http (& even ftp) does seem like they should be part of core. Atleast for HTTP it uses all the internal hooks or requirements that are part of Airflow core's requirement too.

kaxil commented 3 years ago

The following should require explicitly installing them:

"apache.pig": [], "apache.sqoop": [], "dingding": [], "discord": [], "openfaas": [], "opsgenie": [], "sqlite": [],

vikramkoka commented 3 years ago

Absolutely agree that http should be part of core. Strongly in favor of ftp as well being part of core, assuming no additional dependencies. Tempted with imap, but unsure on the dependencies.

Nothing else comes close IMHO

ryw commented 3 years ago

i like adding imap -- essentially we're saying lower-level protocols are core (ftp, http) so imap fits into that list

XD-DENG commented 3 years ago

The following should require explicitly installing them:

"apache.pig": [], "apache.sqoop": [], "dingding": [], "discord": [], "openfaas": [], "opsgenie": [], "sqlite": [],

I agree with @kaxil , other than sqlite.

Personally I think sqlite should come together with Airflow core by default, without explicit extra installation, Considering two examples:

potiuk commented 3 years ago

Looks like ["http", "ftp", "sqlite", "imap"] is the winning set. They are all rather small and they increase the size of installation by likely less than 1%.

I am concerned that this is a good idea. I think it would be worthwhile for the user to pin a specific version so that they do not accidentally install a newer version that may contain regressions.

@mik-laj -> I do not think we have to move them to the "core". I can easily make those extras "enabled" by default as extras that are always used implicitly. This means that while they will be installed by default in their latest version even with pip install airflow will also install those 4 providers. There will be no "constraints" for those - the user will have to explicitly upgrade them and will keep the possibility of downgrading them. I will update FAQs explaining this behavior.

One more comment: I also think it will be great to have a few providers installed from day zero. People might not fully realize that there are providers and they might be surprised to not see those other integrations installed but by seeing few providers pre-installed, this will be much more obvious. Simply 'pip freeze | grep apache-airflow` will show them how provider packages look like.

If there will be no more comments shortly, I will write this proposal to the devlist.

turbaszek commented 3 years ago

I do not think we have to move them to the "core".

@potiuk doesn't that mean that we keep them in core and make them available to all users, but they still have to refactor their DAGs (due to import changes)? Should we limit the number of changes required in users' DAGs?

potiuk commented 3 years ago

@potiuk doesn't that mean that we keep them in core and make them available to all users, but they still have to refactor their DAGs (due to import changes)? Should we limit the number of changes required in users' DAGs?

I think moving them to core now is NOT a good idea, and I think most of the "core" operators were moved inside the core anyway - at least changed module names to conform to AIP-21. I do not think there is a big difference whether they moved inside the core, or whether they are moved to providers.

http_operator -> http
contrib.ftp_operator -> ftp 

etc

ashb commented 3 years ago

Anyone know how pip would cope with circular dependencies? I.e. could apache-airflow depend upon apache-airflow-provider-http (which in turn depends upon apache-airflow without giving pip a heart attack?

That we we can have "batteries included" but still keep the advantages of keeping smaller releases/easier updating of providers.

Edit: oh Jarek has a plan already. Cool

potiuk commented 3 years ago

Anyone know how pip would cope with circular dependencies? I.e. could apache-airflow depend upon apache-airflow-provider-http (which in turn depends upon apache-airflow without giving pip a heart attack?

That we we can have "batteries included" but still keep the advantages of keeping smaller releases/easier updating of providers.

Edit: oh Jarek has a plan already. Cool

Yep. This is already happening with all providers when we specify extras, PIP is cool with that :)