apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.71k stars 14.21k forks source link

openlineage: improve how sql utils parse table schemas #35552

Open JDarDagran opened 11 months ago

JDarDagran commented 11 months ago

Apache Airflow version

main (development)

What happened

For SQL based operators there is airflow.providers.openlineage.utils.sql module used by SQLParser interface class. In short: it allows to parse table schemas based on input and output dataset parsed from SQL query.

What you think should happen instead

It should take into consideration if there is database/schema from connection setup detected from information schema query result. If there is one found it should stop adding other tables.

How to reproduce

Corner case is following:

  1. use database connection with database and/or schema default set
  2. refer to table name only in SQL query (e.g. SELECT * FROM my_table instead of SELECT * FROM my_schema.my_table)
  3. if there's the same table name in other database/schema (or database+schema combination, it depends on database) OL integration will produce two datasets for tables. For instance if one uses postgres with search path set to public schema SELECT * FROM my_table would get data from public.my_table even if there is another table with the same name but different schema. OL integration will take both my_schema.my_table and public.my_table.

Operating System

macOS

Versions of Apache Airflow Providers

apache-airflow-providers-openlineage==1.2.0

Deployment

Other Docker-based deployment

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

Code of Conduct

boring-cyborg[bot] commented 11 months ago

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

potiuk commented 11 months ago

Good ideas!