apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.37k stars 14.11k forks source link

Add support selective docs/tests (provider-related) building in CI build-docs step #12983

Closed potiuk closed 3 years ago

potiuk commented 3 years ago

Thanks to #8874 we have now the possibility of selectively building docs, we should add support for it for our CI builds selective checks.

This will immensely speed up the CI builds for most of the PRs because we can build only documentation for Providers if particular providers changed and for core only when core doc changes.

UPDATE: Breeze already has all the support needed, so this one is only about the CI speedups

potiuk commented 3 years ago

cc: @mik-laj @kaxil @ashb - > good opportunity for one of you to get a bit more understanding of the CI :). Hapy to help with that.

ashb commented 3 years ago

You can already do ./breeze build-docs -- --package-filter apache-airflow -- are you thinking that we remove the need for the -- and add things to completion here?

potiuk commented 3 years ago

More like documenting it in the docs and help (I suspected it works). And for CI it's a bit more involved. we ned to figure the right flags looking at the list of changes coming in the PR

potiuk commented 3 years ago

 Detailed usage for command: build-docs

 breeze build-docs [-- <EXTRA_ARGS>]

       Builds Airflow documentation. The documentation is build inside docker container - to
       maintain the same build environment for everyone. Appropriate sources are mapped from
       the host to the container so that latest sources are used. The folders where documentation
       is generated ('docs/_build') are also mounted to the container - this way results of
       the documentation build is available in the host.

       The possible extra args are: --docs-only, --spellcheck-only, --package, --help
potiuk commented 3 years ago

We need to have list of packages here most likely.

potiuk commented 3 years ago

Or maybe even not - I even did not know the --package flag has been added here and I just checked that --help displays all packaages. So Breeze part is done, only the CI speedup is needed with selective checks.

mik-laj commented 3 years ago

This is a bit dangerous as one documentation package can depend on another, so if we are not building all packages there is a chance that building one package will cause problems in another package and we will not see it before merging.

I think it might be a good idea to wait a while while we fix other minor teething problems and then keep on adding optimizations. This task will also be easier if we finish this ticket: https://github.com/apache/airflow/issues/12282

For now, our tasks are still queued most of the time, as other projects use much more resources. Optimizing these projects can be of greater benefit to all.

Is there any reason why this optimization can't wait? In addition, of course, the passion for continuous improvement for our CI. ❤️

potiuk commented 3 years ago

I do not think it is "we must do it now" :). However, doc building takes now 23 (!) minutes on CI, so especially in the case of the "no-tests needed" builds that do not run any of the unit tests, it is definitely the one that stands out.

I think it has very little to do with #12282 to be honest because selective checks are NOT part of Breeze. They are fully separated and they have nothing to do with Breeze at all. The selective checks could have been rewritten in python as part of implementing the 'selective docs'. That might be actually a very good idea.

And I agree it is a bit dangerous, but so all the selective tests are (however we always have the safety-net of master builds after merge that run all the tests and docs and this should remain after this change)

We can make the logic a bit more complex if we rewrite it in python to make some protection against it (for example check very quickly if there are no :doc: or other references to the provider package that has just been changed.

Just saying - it's a great opportunity to both - get some real benefits (in terms of feedback time and CI job strain) and to do it in a 'future-proof' way (i.e. python).

WDYT?

potiuk commented 3 years ago

For now, our tasks are still queued most of the time, as other projects use much more resources. Optimizing these projects can be of greater benefit to all.

BTW. Unlike speeding our builds, we have very limited impact there. We are already involved in the build meetings from the INFRA and we started together stats of other projects. If you have some ideas how we can do it differently - feel free to propose it (possibly in the builds@ discussion list).

I think we should focus on what we CAN do.

kaxil commented 3 years ago

I agree with @mik-laj 's comment in https://github.com/apache/airflow/issues/12983#issuecomment-742515225

As I said in https://github.com/apache/airflow/issues/13716#issuecomment-761702451: let's not try to optimize this because we use cross-reference of apache-airflow in some providers and vice-versa, so we should build all the docs.

potiuk commented 3 years ago

All such cross references are already automatically checked (via pre-commit) and stored in https://github.com/apache/airflow/blob/master/airflow/providers/dependencies.json .

I think we should be then very consistent in what we propose as optimizations and base it on actual gains achieved for everyone (not only for committer who push their change). I want to think 'globally' in terms of regular contributor adding their code not only in "let's optimise the speed of changes that I usually do".

I returned back to the subject after @mik-laj raised this: #13706

Now I want us to consider what makes more sense:

1) Optimizing #13706 - doc/*.rst documentation-only change

2) Optimizing docs build per-provider - only one provider code+documentation changes

We are optimizing not for "now" (i.e. 2.0.1 release and 2.0.0 'cleanups') but also for the next (say) 5 months of Airflow Builds.

Whatever we invest now in - will get return of that investment in the coming months.

Let's compare how often those builds happen (1) vs (2).

For (1) those are builds that are usually "cleanup and refactor kinds" - we have them now a bit more often because we are cleaning up stuff - restructuring the docs a little, fixing last spell checking issues. However in "regular" work. I'd argue changing only documentation is a bad smell. Documentation should usually be changed together with the code when code change happens. So I argue that those builds will be extremely rare in the future. And they will mostly not impact the people who are making 'substantial' changes - those who need fast feedback on their builds.

On the other hand, I think we will have a lot (I think vast majority) of "one-provider-only" changes coming from contributors. Those are already quite optimized during the tests and we can (I plan to) also easily add 'run tests only for this provider and dependent ones (using the dependencies.json).

This will cut down test time to ~ 5 minutes for typical providers. Now, full docs build takes 25 minutes. If we also limit it to only those providers that changed, this docs build will also take 5 minutes for most of the providers. We do not have to build "airflow" docs in this case - there should be "0" references from airflow to particular providers docs.

This basically means that a contributor will get feedback about 'one-provider' change 20 minutes faster. Multiply it by a number of "one-provider" changes.

I have no time to pull the 'hard" data - i.e. how many changes we might have in "regular" 2. development that are (1) "doc/.rst" only vs. (2) "single-provider" only. But my intuition tells me that over the next months vast majority of the changes will be of the (2) kind.

I firmly believe we should optimize those things that have higher impact and bring more benefits. I hate doing micro-optimisations, I always think "long-term"/"high impact" when I am doing it.

WDYT @kaxil @mik-laj which of those are worth it? (1) or (2)? I argue (above) that (2) has much higher impact and brings more benefits to the community as whole. If you think we should do (1) rather than (2), I'd love to hear your line of thoughts an reasoning why you think it is better to do it.

kaxil commented 3 years ago

All such cross references are already automatically checked (via pre-commit) and stored in https://github.com/apache/airflow/blob/master/airflow/providers/dependencies.json .

This is not the cross-reference we (at least me) are talking about. Google Providers for example reference xcom documentation from apache-airflow (core) docs. Example: https://github.com/apache/airflow/blob/master/docs/apache-airflow-providers-google/operators/marketing_platform/campaign_manager.rst

And similarly apache-airflow (core) docs reference ~airflow.providers.http.operators.http.SimpleHttpOperator and ~airflow.providers.sqlite.operators.sqlite.SqliteOperator, ~airflow.providers.jdbc.hooks.jdbc.JdbcHook

So at the very least, we need to build docs:

Now I want us to consider what makes more sense:

  1. Optimizing #13706 - doc/*.rst documentation-only change
  2. Optimizing docs build per-provider - only one provider code+documentation changes

(2) is not possible because of the reason I explained above. We need (3) Optimizing docs build core + apache-airflow-providers + per-provider - only one provider code+documentation changes + core docs + apache-airflow-providers

I'd argue changing only documentation is a bad smell. Documentation should usually be changed together with the code when code change happens. So I argue that those builds will be extremely rare in the future. And they will mostly not impact the people who are making 'substantial' changes - those who need fast feedback on their builds.

Wrong, we encourage doc only changes for new contributors, where they can add a missing section, fix formatting issues, correct outdated information, definitely not a bad smell. Documentation has to be a first-class citizen and important for us as a project, more so because we are an OSS project. The history of doc only changes (based on a number of commits) is also large. And at least documentation will be always evolving (getting better) -- not only now because we released a major version.

We do not have to build "airflow" docs in this case - there should be "0" references from airflow to particular providers docs. I don't think so, it is completely find for just to reference for example a Slack provider to explain sla_miss_callback functionality or Secrets Backend.

Example references:

best-practices.rst:207:Similarly, if you have a task that starts a microservice in Kubernetes or Mesos, you should check if the service has started or not using :class:`airflow.providers.http.sensors.http.HttpSensor`.
concepts.rst:434:by pre-installing some :doc:`apache-airflow-providers:index` packages (they are always available no
concepts.rst:437:- :class:`~airflow.providers.http.operators.http.SimpleHttpOperator` - sends an HTTP request
concepts.rst:438:- :class:`~airflow.providers.sqlite.operators.sqlite.SqliteOperator` - SQLite DB operator
concepts.rst:443:additional packages manually (for example ``apache-airflow-providers-mysql`` package).
concepts.rst:447:- :class:`~airflow.providers.mysql.operators.mysql.MySqlOperator`
concepts.rst:448:- :class:`~airflow.providers.postgres.operators.postgres.PostgresOperator`
concepts.rst:449:- :class:`~airflow.providers.microsoft.mssql.operators.mssql.MsSqlOperator`
concepts.rst:450:- :class:`~airflow.providers.oracle.operators.oracle.OracleOperator`
concepts.rst:451:- :class:`~airflow.providers.jdbc.operators.jdbc.JdbcOperator`
concepts.rst:452:- :class:`~airflow.providers.docker.operators.docker.DockerOperator`
concepts.rst:453:- :class:`~airflow.providers.apache.hive.operators.hive.HiveOperator`
concepts.rst:454:- :class:`~airflow.providers.amazon.aws.operators.s3_file_transform.S3FileTransformOperator`
concepts.rst:455:- :class:`~airflow.providers.mysql.transfers.presto_to_mysql.PrestoToMySqlOperator`,
concepts.rst:456:- :class:`~airflow.providers.slack.operators.slack.SlackAPIOperator`
concepts.rst:459:at :doc:`apache-airflow-providers:index`.
concepts.rst:795:``conn_id`` for the :class:`~airflow.providers.postgres.hooks.postgres.PostgresHook` is
howto/connection.rst:332:can also add a custom provider that adds custom connection types. See :doc:`apache-airflow-providers:index`
howto/connection.rst:339::py:class:`~airflow.providers.jdbc.hooks.jdbc.JdbcHook`.
howto/connection.rst:350:You can read more about details how to add custom provider packages in the :doc:`apache-airflow-providers:index`
howto/custom-operator.rst:101:See :doc:`connection` for how to create and manage connections and :doc:`apache-airflow-providers:index` for
howto/custom-operator.rst:268:is :class:`airflow.providers.google.cloud.sensors.gcs.GCSUploadSessionCompleteSensor`.
howto/define_extra_link.rst:66:all the operators through an airflow plugin or through airflow providers. You can learn more about it in the
howto/define_extra_link.rst:67::ref:`plugin example <plugin-example>` and in :doc:`apache-airflow-providers:index`.
howto/define_extra_link.rst:77:tasks using :class:`~airflow.providers.amazon.aws.transfers.gcs_to_s3.GCSToS3Operator` operator.
howto/define_extra_link.rst:86:  from airflow.providers.amazon.aws.transfers.gcs_to_s3 import GCSToS3Operator
howto/define_extra_link.rst:112::class:`~airflow.providers.google.cloud.operators.bigquery.BigQueryExecuteQueryOperator` includes a link to the Google Cloud
howto/define_extra_link.rst:119:    from airflow.providers.google.cloud.operators.bigquery import BigQueryOperator
howto/define_extra_link.rst:145:As explained in :doc:`apache-airflow-providers:index`, when you create your own Airflow Provider, you can
howto/define_extra_link.rst:150:by ``apache-airflow-providers-google`` provider currently:
howto/define_extra_link.rst:155:      - airflow.providers.google.cloud.operators.bigquery.BigQueryConsoleLink
howto/define_extra_link.rst:156:      - airflow.providers.google.cloud.operators.bigquery.BigQueryConsoleIndexableLink
howto/define_extra_link.rst:157:      - airflow.providers.google.cloud.operators.mlengine.AIPlatformConsoleLink
howto/email-config.rst:76:      email_backend = airflow.providers.sendgrid.utils.emailer.send_email
installation.rst:88:has a corresponding ``apache-airflow-providers-amazon`` providers package to be installed. When you install
installation.rst:106:see: :doc:`apache-airflow-providers:index`
installation.rst:108:For the list of the provider packages and what they enable, see: :doc:`apache-airflow-providers:packages-ref`.
modules_management.rst:272:    apache-airflow-providers-amazon           | 1.0.0b2
modules_management.rst:273:    apache-airflow-providers-apache-cassandra | 1.0.0b2
modules_management.rst:274:    apache-airflow-providers-apache-druid     | 1.0.0b2
modules_management.rst:275:    apache-airflow-providers-apache-hdfs      | 1.0.0b2
modules_management.rst:276:    apache-airflow-providers-apache-hive      | 1.0.0b2
operators-and-hooks-ref.rst:23::doc:`apache-airflow-providers:operators-and-hooks-ref/index`.
plugins.rst:169:    from airflow.providers.amazon.aws.transfers.gcs_to_s3 import GCSToS3Operator
production-deployment.rst:762:Some operators, such as :class:`airflow.providers.google.cloud.operators.kubernetes_engine.GKEStartPodOperator`,
production-deployment.rst:763::class:`airflow.providers.google.cloud.operators.dataflow.DataflowStartSqlJobOperator`, require
production-deployment.rst:869:If you want to establish an SSH connection to the Compute Engine instance, you must have the network address of this instance and credentials to access it. To simplify this task, you can use :class:`~airflow.providers.google.cloud.hooks.compute.ComputeEngineHook` instead of :class:`~airflow.providers.ssh.hooks.ssh.SSHHook`
production-deployment.rst:871:The :class:`~airflow.providers.google.cloud.hooks.compute.ComputeEngineHook` support authorization with Google OS Login service. It is an extremely robust way to manage Linux access properly as it stores short-lived ssh keys in the metadata service, offers PAM modules for access and sudo privilege checking and offers nsswitch user lookup into the metadata service as well.
security/secrets/secrets-backend/index.rst:69:    airflow.providers.google.cloud.secrets.secret_manager.CloudSecretManagerBackend
upgrading-to-2.rst:117:    from airflow.providers.docker.operators.docker import DockerOperator
upgrading-to-2.rst:126:automatically installs the ``apache-airflow-providers-docker`` package.
upgrading-to-2.rst:131:You can read more about providers at :doc:`apache-airflow-providers:index`.
upgrading-to-2.rst:507:  * You can read more about providers at :doc:`apache-airflow-providers:index`.

I firmly believe we should optimize those things that have higher impact and bring more benefits. I hate doing micro-optimisations, I always think "long-term"/"high impact" when I am doing it.

We should not fall in the trap of over-optimisation though. We can still get a good reduction in time (that will result in more benefits and hopefully higher impact) as we won't be building docs for all providers (only apache-airflow, apache-airflow-providers and apache-airflow-provder-PROVIDER_THAT_CHANGED.)

WDYT @kaxil @mik-laj which of those are worth it? (1) or (2)? I argue (above) that (2) has much higher impact and brings more benefits to the community as whole. If you think we should do (1) rather than (2), I'd love to hear your line of thoughts an reasoning why you think it is better to do it.

As I said above, I would vouch for (1) and (3) not (2) or just (3) where it will supersede (1) too.

potiuk commented 3 years ago

Fine for me: I'd say then we need (4). Build only docs components that come out of both .py and .rst cross-references.

Let's build the dependencies including documentation and generate a minimum set of "packages" that makes sense for it. Looking at your example, this should be as easy as finding all ~airflow.providers references and we can build such a graph. This means that if for example one commits a change to "singularity" provider which is not referred anywhere else, it should only build docs (and tests) for that provider only.

Does it make sense? I think this will be much easier to reason about than to hardcode core - +

potiuk commented 3 years ago

And I argue it's only the "target" reference that matters. I .e. if a provider refers a core class but 'core" does not change at all (and core does not refer this provider), it means that only the provider doc/tests should be built.

potiuk commented 3 years ago

And I'd say both optimisation (remove other jobs and selective build) are equally important for me - only when we combine the two we will get real optimization - similar to what we have now when someone changes README.rst or any other "non-doc" folder documentation change.

mik-laj commented 3 years ago

I'd argue changing only documentation is a bad smell.

There are several types of documentation:

(source: https://www.polidea.com/blog/how-open-source-guide-new-contributors-maintainers/#document-the-code)

Only in the case of reference documentation, it is a "bad smell", but other types of documentation very often arise as independent contributions, because it takes a different kind of skill.

mik-laj commented 3 years ago

I firmly believe we should optimize those things that have higher impact and bring more benefits. I hate doing micro-optimisations, I always think "long-term"/"high impact" when I am doing it.

Any optimization that is not just a syntax change has a cost. Sometimes it is just a matter of dedicating time to implementation and reviews. However, optimizations are very often associated with higher costs. In this case, it seems to me that we have several factors that we must take into account:

We can modify each of these factors independently but each influences the others. If we try to optimize the feedback time very aggressively, it may affect its reliability. We will have to analyze and solve problems related to documentation, but we will not focus on the contribution itself. We don't have too many experts who know the documentation process, so any such problem will be quite a pain to invest both for us and for new contributors who will not be aware that they have to overcome certain limitations of our optimization.

When it comes to feedback, we have a solution that is not overly burdensome for the contributor and is much faster than our CI. It is just building documentation locally where you can check documentation in less than 1 minute for most packages. Encouraging the use of building documentation locally will also improve the user experience as the user will be able to quickly preview this change locally, but we need to encourage users to do so first.

Looking at your example, this should be as easy as finding all ~airflow.providers references and we can build such a graph.

This is just one example, but it does not cover all possible ways to reference other packages. For example, the user can also reference using :ref:`howto/operator:BigtableCreateInstanceOperator` or :doc:`apache-airflow-providers-google:operators/cloud/bigtable.rst`. The number of combinations is almost unlimited, and these references can appear in both .rst and docstring files. You are only sure of this when all these references are resolved, but this is done by Sphinx.

potiuk commented 3 years ago

What do you mean by "unlimited" ?

Surely we are using limited number of those and we can enumerate those.

And any time we found we missed one and it caused faillure we can find it and fix it if we find it caused problem (and incrementally get better at catching such cases). We will not get it perfect the first time, but it's fine. We will catch it an fix it, at the same time vast majority of the build will benefit from that.

Again: Mulitply savings by number of times those builds are run. this is the gain. Implementation is one-time cost.

mik-laj commented 3 years ago

What do you mean by "unlimited" ?

Here is a list of examples, some are incorrect but most should work. https://gist.github.com/mik-laj/9b5417532ee682a6f871ee90ae5869c0

And any time we found we missed one and it caused faillure we can find it and fix it if we find it caused problem (and incrementally get better at catching such cases). We will not get it perfect the first time, but it's fine. We will catch it an fix it, at the same time vast majority of the build will benefit from that.

This is what I want to avoid. I would like to focus more on contributing to the project or to project documentation, rather than contributing to the contributing tools, unless absolutely necessary. A very small group of people wants to focus on contributing to the tools for contribution.

potiuk commented 3 years ago

This is what I want to avoid. I would like to focus more on contributing to the project or to project documentation, rather than contributing to the contributing tools, unless absolutely necessary. A very small group of people wants to focus on contributing to the tools for contribution.

The only way to get better is to continuously improve your tools, especially when you can improve lives of many people by making improvements. I think this is extremely important.

mik-laj commented 3 years ago

improve your tools,

This is a very subjective word. For some, it is important that the tools are stable, reliable and as simple as possible. For others, it is more important that they are fast, efficient, feature-full, but this can create difficulties with stability and reliability and add more complexity. I am in favor of introducing these optimizations if we are confident in its stability and reliability, and it is also easy to maintain.

potiuk commented 3 years ago

This is a very subjective word. For some, it is important that the tools are stable, reliable and as simple as possible. For others, it is more important that they are fast, efficient, feature-full, but this can create difficulties with stability and reliability and add more complexity. I am in favor of introducing these optimizations if we are confident in its stability and reliability, and it is also easy to maintain.

Yep. I usually prefer to make life easier for others, even if it a bit more complex for me.

kaxil commented 3 years ago

This is what I want to avoid. I would like to focus more on contributing to the project or to project documentation, rather than contributing to the contributing tools, unless absolutely necessary. A very small group of people wants to focus on contributing to the tools for contribution.

Strongly agree with this statement from @mik-laj .

The only way to get better is to continuously improve your tools, especially when you can improve lives of many people by making improvements. I think this is extremely important.

Agree with that in principle, but the most important part of the contributing tool is "stability". If optimisation comes at the cost of stability and making it less easy to comprehend (our CI system is already getting difficult to grasp at least all of it).

And any time we found we missed one and it caused faillure we can find it and fix it if we find it caused problem (and incrementally get better at catching such cases). We will not get it perfect the first time, but it's fine. We will catch it an fix it, at the same time vast majority of the build will benefit from that.

That is causing a significant issues with UX because Master starts failing and we need to tell everyone to rebase to Master. I can understand it when it is absolutely needed or a big optimisation but for the current docs optimisation, while the (3) or (4) we mentioned in the comment would be good to have but not if it comes with a cost.

This is a very subjective word. For some, it is important that the tools are stable, reliable and as simple as possible. For others, it is more important that they are fast, efficient, feature-full, but this can create difficulties with stability and reliability and add more complexity. I am in favor of introducing these optimizations if we are confident in its stability and reliability, and it is also easy to maintain.

Agree with this 100% too -- I would strongly prefer stability than speed. Like Kamil said -- only if it necessary or does not compromise stability otherwise that time can be used to improve the Product itself. The CI system / script should not become a product of it's own. It is there to compliment the product and make it more stable -- which with breeze I think we already have.

potiuk commented 3 years ago

Surely this is all subjective. And it's soooo easy to forget about things like werkzeug drama (and a number of other similar problems). And about alll the issues flooding us when one day suddenly airflow could not be installed because dependencies broken. Or about mypy/flake errors that a number of people could not reproduce or about failing kubernetes tests because the scripts combined tox, docker image and the K8S setup was failing randomly (which was far more complex than current tests TBH and no-one understood it then anyway). Or about conflicting constraints and totally outdated dependencies and base python images we had for the docker image (which are getting improved)

Most of the complexity of the system we have now is the result of fixing those problems (in the way that prevents them from reoccurring - that's why all those things are constantly running on our CI and double-checking). The complexity is emerging from the complexity of the system we have, not "reason" on its own.

This is often the case that you fail to see the things that "just work" and forget how much work was put into getting to that "just works" state. It's really, really easy to be overhwelmed by a number of issues like that. If you do not continuously work on fighting such issues (and yest sometimes it means increasing complexity) you might one day wake up in the state that you have just enough capacity to fight with the on-going issues.

It's super easy to underestimate importance of it and put a blind eye on it. I've seen many CI systems that took that slippery road - the problem with it is that once you let it happen it goes down with accelerating speed. System of that size and that many people contributing people is inevitable to take quite some maintenance time.

I wonder if you've thought about the number changes (when CI works reasonable) that we are able to handle because of all the systems we have in place now rather than "just code". If you look at that chart below, we have hardly more active people working on Airflow as 'active commiters' than 1.5 years ago, yet we are able to handle many more changes:

Screenshot from 2021-01-18 20-10-42

This is compound change of many things - including the fact that most of us can focus on contribution to product, because there are people who constantly keep the machinery not only going but also improving and rolling the wheels better (even if is more complex).

kaxil commented 3 years ago

Again, 100% agree to that @potiuk -- Thanks to you leading that effort we are in a much better shape.

But for the issue at hand about the documentation optimisation was what we or at least I was talking about. Would definitely love to not break Master for that. As I said, I am happy with (3) but we need to be careful with trying to optimize where not strictly needed as I mentioned in the previous comment.

potiuk commented 3 years ago

We can close this one. Introducing parallel building of docs solved this problem in entirely different way.