apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.69k stars 14.2k forks source link

Add support for creating backfills to the stable REST API #18816

Open SamWheating opened 3 years ago

SamWheating commented 3 years ago

Description

I'd like to be able to trigger backfills remotely through a REST API endpoint - there's already support for triggering a single DAGRun so why not multiple?

This would involve creating a new endpoint (maybe under /dags/<dag_id>/backfills) which would handle the same parameters as the airflow dag backfill CLI command.

A coworker and I would be happy to build out this functionality, but I'd like to collect input from the community and core maintainers before we start.

Use case/motivation

We use a custom CLI built on top of the Airflow REST API for doing some basic Airflow operations via from the command line - It would be awesome if we could include support for triggering backfills remotely.

Related issues

No response

Are you willing to submit a PR?

Code of Conduct

potiuk commented 3 years ago

Actually improving a backfill support is already quite heavily discussed but not an easy one - mainly because backfills need to be be managed see https://github.com/apache/airflow/discussions/18428 -> this is much more than REST API call, it needs a separate component (or a feature of existing one) to manage and keep track of backfill requests IMHO.

I honestly think it is badly needed and someone to take a lead with it and understand the complexitty an dcases + implement and test it. Maybe one of those who want it could take a lead here?

SamWheating commented 3 years ago

Oh cool, I hadn't seen that discussion. Thanks for sharing!

It sounds like the larger backfill redesign is a much larger feature which will likely require its own AIP in order to gather requirements and feedback from the community, correct? I will think about some requirements and evaluate how much capacity I have to get involved with a larger change.

In the short term, I think it wouldn't be too difficult to expose the existing CLI-based backfill functionality in the REST API, but I acknowledge that this comes with some downsides:

What do you think, is it worth proceeding with a near-term fix or should any changes to the backfill functionality be grouped together as part of a larger AIP?

potiuk commented 3 years ago

I think without the larger redesign, the backfill API is not too much useful - and even I'd argue current API has everything (or most of) what you need to be able to do the backfill already (but here I might be mistaken).

I imagine two ways of doing backfill (and by backfill I understand clening and re-running of series of historical dag runs - posibly for only subset of tasks: certain tasks and all tasks tha depend on them.

My view on it is that you can do it in two ways (but this would need to be brought to the devlist if we would like to move it forward either way - as this is only my opinion and I might be mistaken, maybe there are other, simpler ways) :

1) "active" - basically replicating the way current airflow backfill does it. You have a "user controlled" entity that monitors and controls the backfill. In airflow backfill it is a process started in the terminal that loops through all the historical dag runs, cleans and re-runs them. This requires uninterrupted connection to Airflow DB from the terminal, monitoring and reporting the status of the jobs and active "scheduling" of tesks like if you manually run them. I'd argue you can do it today with the current API or with small additions to it (to be verified), the only missing piece is to add the "another client" that will do it rather than the "airflow backfill" process (and use the API to do the same that the airflow backfill does by direct DB access and running pieces of Airflow scheduling/dagrun code in the proces). That is doable, it does not change the "model" of backfil, and it allows to use the API rather than requiring to have the airflow backfill process to be run somewhere where DB of airflow is directly accessible. This might be doable without major design/aip/changing the scheduler behaviour etc. I think.

However I'd also argue the usefulness of that is limited because you still need active client same way you need now. The only benefit is that you do not need "airflow" package installed in the client and you do not need the direct DB access. And if you do it only for backfill, it would be at most a tactical solution.

I'd say it would be much better instead (more future proof) - to extend the airflow cli to be able to do everything currrent CLI does via API and make a separate airflow-cli package that you could install independently from Airflow. That is someting that partially worked in 1.10 (but it was rarely used and brittle) - the CLI then could use experimental API for some operations and perform small set of actions without the DB access. It could be done incrementally, starting from backfill, but I think it's worth doing it with the "Remote airflow CLI" as a goal not just backfill - then it makes sense I think and might be a very good "strategic" direction.

2) passive - you submit "BackfillJob"s via API (and there are API calls that can check the progress). Then in order to perform the backfill you must have a component (could be aither modified scheduler or separate component) that continuously runs, executes and monitors the backfills and you also need to have a UI to webserver to monitor, possibly re-run the Backfill Jobs. This is a much bigger effort that requires archuitectural changes in the way how scheduler operates, or - more likely - implementing another scheduler-like component that would manage and control such backfills. I believe (@ashb?) the current scheduler is heavily optimized in the way that it will be difficult to make it runs and control such Backfill jobs, so having a separate component might make more sense.

We'd need DB modification to keep status and monitor the backfill and UI interface to view and monitor them. This is the "ultimate" backfill solution that might make backfill a first-class-citizen. But the effort required here is much bigger + it has some connected components that will need to be updated (Helm Chart for one, documentation on how to run and install Airflow, Docker Compose quick start etc. etc. ) - similar set of changes that were required when we added the "triggerer" for Defferable Tasks for the upcoming 2.2. But again - if we would like to discuss the way how to approach it - some proposal will have to be brought to the devlist so that others have a chance to take part in the discussion. Improving Backfill is one of those "important" but not "urgent" things and any change in the approach or changing the CLI to be able to use the API, needs to be raised there.