apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.44k stars 14.11k forks source link

Airflow REST API: Add generic capability to retrieve required information for the specified entity.. #29893

Open ankushpurwar opened 1 year ago

ankushpurwar commented 1 year ago

Description

Airflow REST API should add a generic capability to retrieve the required information only. Instead sending all of them. E.g. If I want to retrieve DAG Run details using REST API: https://airflow.apache.org/api/v1/dags/{dag_id}/dagRuns/{dag_run_id} Or want to fetch list of DAGs using RET API: https://airflow.apache.org/api/v1/dags

It always returns the full details. Often it is the case where caller is not interested in all the information.

So I suggest to add a generic capability to retrieve only needed information just like offset and limit. E.g. if we pass fields = {dag_id, is_paused} in the query parameter while calling https://airflow.apache.org/api/v1/dags API, So it returns JSON body contains {dag_id, is_paused} fields.

Similarly it is true for other end points as well (At least Get Ones)

Use case/motivation

  1. Optimize the information what we want to retrieve from server.
  2. Saving Network bandwidth by reducing the information to required one.
  3. Possibility to collect more data in one go.

Related issues

Cannot say.

Are you willing to submit a PR?

Code of Conduct

boring-cyborg[bot] commented 1 year ago

Thanks for opening your first issue here! Be sure to follow the issue template!

hussein-awala commented 1 year ago

Sounds like a good feature, want to work on it and be an Airflow contributor?

ghost commented 1 year ago

Can I work on this issue?

hussein-awala commented 1 year ago

@zazemlenie Sure, I assigned it to you

maahir22 commented 1 year ago

Has there been any development on this? Would like to contribute if possible, are we planning to integrate the functionality of fetching only specific fields for every GET end-point? Won't there be an issue with the query string getting too long, or do we plan to impose limits on the granularity of fields that can be fetched? @hussein-awala @zazemlenie

ghost commented 1 year ago

I'm working on this issue. I haven't run into the query string issue you mentioned, but I'll check it out more precisely

maahir22 commented 1 year ago

Awesome, let me know if you need any help!

jackkolbert commented 10 months ago

Hi, I would like to contribute to this issue, could I be assigned it? Thank you

HarryWu99 commented 8 months ago

@hussein-awala @maahir22 Hello I would like to contribute to this issue, could I be assigned it? While I am new to airflow, can I get some help? I can locate airflow/api_connexion/endpoints/dag_endpoint.get_dags, but who called this function? I saw SQLAlchemySchema.dump is used directly as a return, how to extract the required field is a good practice?

potiuk commented 8 months ago

I assigned you - but part of the task is to propose how to do it. Generally speaking, generic retrieval/update of partial information is somethingh that GraphQL attempted to do as the "next gen" API, attempting to "fix" what REST got broken.

https://graphql.org/

However, my personal opinion (and of many people) is that GraphQL is quite a bit TOO generic. It is relatively popular and used in quite a few places - but mostly in "corporate" world and big installation because - unlike REST it is not intuitive and learning curve is, welll, steep IMHO. I never gotten to be thrilled with the idea of learning more about GraphQL and getting the hang of it personally. Also it tried to address all-but-kitchen-sink aspects of the API (including rate limiting, introspection, etc. . in most of the implementations are very difficult to get performance right and there are plenty of other issues with it.

You can read for example here https://blog.logrocket.com/graphql-vs-rest-api-why-you-shouldnt-use-graphql/

IMHO (but this is my opinion) - we need something much simpler and straightforward here and rather then defining and following a "standard", we should possibly tap into other people doing similar things - because our API is described with OpenAPI definition and our REST points documentation and swagger UI and everything we have in the API is generated. That's especially important as our Clients (notably https://github.com/apache/airflow-client-python) are generated using OpenAPI client generator that translates the OpenAPI specification into Python classes that you can import and use directly. This goes for other languages as well.

This is a bit tricky, because the generator produces objects returned, so if API returns partial objects, then it cannot return ACTUAL OBJECTS. It can return dictionaries for example, or some Proxy Objects that actually only contain part of data and the rest of the data might be retrieved lazily.

So finding a way how to do it so that it is:

a) simple b) builds on top of REST not changing it to GraphQL c) nicely integrates with OpenaAPI definition, Swagger d) integrates with Open API generators to allow such partial retrieval

So this task is really:

HarryWu99 commented 8 months ago

What is the meaning of POC, please? As you mention implement a POC. @potiuk

potiuk commented 8 months ago

Proof Of Concept.

HarryWu99 commented 8 months ago

@potiuk Thank you for telling me about the task in detail!🌸 But just for dags or dagRuns, isn't it OK to just add 'only' parameter when the Schema() is created?

dag_schema = DAGSchema(only=fields)
return dag_schema.dump(dag, )

And add nullable: true to airflow/api_connexion/openapi/v1.yaml in returns properties.

I think it's hard to solve this task generally for now because swagger yaml files are not automatically generated from the schema. If it can be generated automatically, yaml can also set nullable values based on whether the schema field required is true or not

potiuk commented 8 months ago

This issue is about generic funcrtionality. If you want to do only dags or dagRuns limited version - feel free to open PRs with fixes - but they would not close that issue.