apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
37.12k stars 14.31k forks source link

Release Airflow 3.0 #39593

Open kaxil opened 6 months ago

kaxil commented 6 months ago

Hello all,

Creating a meta-issue to track all the projects related to Airflow 3 and pointers on how contributors can help in this effort.

The Home Page for Airflow 3 discussions is: https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+3.0

Workstreams: https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+3+Workstreams

How to participate & help?

Check this doc and find items without an owner; this is the workstream that needs someone in the community to lead. Comment & tag me if you are interested in any of the workstreams.

The following items need an owner:

... and open issues tagged with airflow3.0:candidate with no assignees. There are other great ideas for features that don't need an AIP for Airflow 3.1 in this doc, so if someone has time to do it for 3.0, please add a comment.

Timeline

Date Milestone
9 August 2024 The main branch becomes Airflow 3 as soon as Airflow 2.10 is released
Dec 2024 Dev complete on breaking changes for Upgrade Utilities work
Jan 2025 Alpha releases
Feb 2025 Beta releases
March 2025 Airflow 3 release
jscheffl commented 6 months ago

Why not adding a "Project" --> https://github.com/apache/airflow/projects ?

JossWhittle commented 5 months ago

Is there any planned follow up to AIP-48 to expose a Dataset api to custom providers and give a mechanism for polling for Dataset changes using deferrable triggerers?

Since AIP-48's responsibilities were shrunk so it could be merged there has not been any visible discussion about a follow up AIP or any progress towards the remainder of it's goals on the 3.x roadmap.

kaxil commented 5 months ago

Why not adding a "Project" --> https://github.com/apache/airflow/projects ?

Because we will have multiple "Projects"

kaxil commented 5 months ago

Is there any planned follow up to AIP-48 to expose a Dataset api to custom providers and give a mechanism for polling for Dataset changes using deferrable triggerers?

Since AIP-48's responsibilities were shrunk so it could be merged there has not been any visible discussion about a follow up AIP or any progress towards the remainder of it's goals on the 3.x roadmap.

Not yet, but Airflow 2.9 included the support for Dataset event updates which could act as a proxy for a "push-based" mechanism until we have a poll-based mechanism

JossWhittle commented 4 months ago

@kaxil I found the internal api call to create a DatasetEvent the other day but I'm much of a muchness over whether I want to abuse it to solve my problem.

https://github.com/apache/airflow/blob/35faaf8b5542425248ecc94aaea79c68b998ab16/airflow/datasets/manager.py#L64

https://github.com/apache/airflow/blob/35faaf8b5542425248ecc94aaea79c68b998ab16/tests/api_connexion/endpoints/test_dataset_endpoint.py#L665

I want to be able to have a custom Dataset class listening to a message queue. Currently this is achieved using a separate continuously scheduled DAG with a deferable operator that consumes messages and triggers a DAG run of the actual processing DAG.

This could be changed to create DatasetEvent embedding the message(s) from the queue into the extra field, and have the processing DAG schedule on that Dataset. Would this be inherently dangerous to do?

Using an external DAG for polling in either case at least gets all the fault tolerance and deferability of a DAG, and means status and history is shown in the UI.

A downside I am seeing though is that when my DAGs finish and I write to a message queue, this is entirely decoupled from being able to declare that outgoing queue as a Dataset outlet.

In fact, outlets can't really be used at all here because we want to trigger on the message being pulled from the queue by another polling DAG, not by the current DAG simply succeeding which won't have passed the message(s) into the extra field.

This means the graph of inter-dag dependencies is always broken up which is unfortunate.

Perhaps in the meantime Dataset could get a constructor argument to prevent triggering a DatasetEvent when used as an outlet. This would allow outlets to be used to mark up inter-dag dependencies.


I think in a world where there is a polling mechanism, Dataset outlets on succeeding tasks should only hint to Airflow that poll-ers should poll, but shouldn't create a DatasetEvent directly. Poll-ers should be deferable and fault tolerant, so an outlet Dataset firing really just means waking the poller immediately if it is deferred. Otherwise it will pick it up on it's own.

gopidesupavan commented 4 months ago

Hi @kaxil , Would like to take part of this Airflow 3 Journey , happy to contribute here. I can take look into this Improvements to Sensors.

amoghrajesh commented 4 months ago

@kaxil for items such as "Airflow Standalone Improvements" and "Improve Debugging Story" - I think we need more than just the heading because these are pretty open ended and I'd love to contribute to them and also "Remove deprecated code" :)

gyli commented 3 months ago

Hi @kaxil, I would love to take Consolidate "Serialization" code, could you please create an issue for this one?

gopidesupavan commented 3 months ago

Hi @kaxil

Over the past few days, I've thoroughly examined the core codebase and now have a strong grasp of the sensor component. I've also been contributing to Airflow for several months.

I would like to further explore the core areas. I feel like this is a great opportunity at this point.

Could you please throw some light on what the expectation on this Change sensors to use async by default?

In confluence it is mentioned. https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+3+Workstreams Sensor Improvements about sensor modes.

Remove poke/reschedule mode from sensors

But here in above it mentioned remove poke mode.

Is it removing both modes and keeping completely default always with async?

Appreciate your help 😄 ..

And not sure if anyone else is already working on this. am happy to contribute other areas also 😄

raphaelauv commented 3 months ago

for airflow 3 let's rename Dataset -> DagEvent or TriggerEvent

kaxil commented 3 months ago

@raphaelauv -- @vincbeck is owning "Poll external Datasets to have event-based DAG scheduling." which has adding the concept of "Events" as mentioned in https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+3+Workstreams#Airflow3Workstreams-Othercandidates . That ideally should have DatasetUpdateEvent, TaskCompletionEvent, DagCompletionEvent etc. I am waiting to see an AIP but would like to see that apart from External polling

kaxil commented 3 months ago

@gopidesupavan Yes, ideally we use the best possible way to run a sensor, which should be async implementation on Triggerer. If triggerer isn't available, it can fallback to poke or reschedule on the worker but at least it should default to most efficient usage without users having to mention it. But feel free to create your proposal on what you think is the best option

Hi @kaxil

Over the past few days, I've thoroughly examined the core codebase and now have a strong grasp of the sensor component. I've also been contributing to Airflow for several months.

I would like to further explore the core areas. I feel like this is a great opportunity at this point.

Could you please throw some light on what the expectation on this Change sensors to use async by default?

In confluence it is mentioned. https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+3+Workstreams Sensor Improvements about sensor modes.

Remove poke/reschedule mode from sensors

But here in above it mentioned remove poke mode.

Is it removing both modes and keeping completely default always with async?

Appreciate your help 😄 ..

And not sure if anyone else is already working on this. am happy to contribute other areas also 😄

gopidesupavan commented 3 months ago

@gopidesupavan Yes, ideally we use the best possible way to run a sensor, which should be async implementation on Triggerer. If triggerer isn't available, it can fallback to poke or reschedule on the worker but at least it should default to most efficient usage without users having to mention it. But feel free to create your proposal on what you think is the best option

Hi @kaxil Over the past few days, I've thoroughly examined the core codebase and now have a strong grasp of the sensor component. I've also been contributing to Airflow for several months. I would like to further explore the core areas. I feel like this is a great opportunity at this point. Could you please throw some light on what the expectation on this Change sensors to use async by default? In confluence it is mentioned. https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+3+Workstreams Sensor Improvements about sensor modes. Remove poke/reschedule mode from sensors But here in above it mentioned remove poke mode. Is it removing both modes and keeping completely default always with async? Appreciate your help 😄 .. And not sure if anyone else is already working on this. am happy to contribute other areas also 😄

Thank you @kaxil, Sure have couple of things i am thinking , will draft my proposal and send it for review soon.

gopidesupavan commented 3 months ago

@kaxil Wanted to give you an update on this. I've been exploring various options to run the sensors entirely in trigger mode and tried POC'S, found a way to do so. I conducted a proof of concept, and the results look promising. Additionally, I identified some possibilities to remove the poke and reschedule processes. However, there are definitely some downsides, and I'm struggling to fully assess them. I could really use your expertise on this :) Will send out draft in this week.

gopidesupavan commented 3 months ago

Hi @kaxil I have sent out draft, tried my best to put my thought and poc :). Appreciate your feedback and suggestions. :)

kaxil commented 3 months ago

Thanks @gopidesupavan I will check it out this week