apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.73k stars 14.22k forks source link

Add Apache XTable provider #40962

Closed gyli closed 2 months ago

gyli commented 2 months ago

Description

Apache XTable translates metadata among datalakes, allowing users to read from datalake with the tools don't have native support. XTable can be executed with command like

java -jar xtable-utilities/target/xtable-utilities-0.1.0-SNAPSHOT-bundled.jar --datasetConfig my_config.yaml [--hadoopConfig hdfs-site.xml] [--convertersConfig converters.yaml] [--icebergCatalogConfig catalog.yaml]

An Airflow operator can be created to wrap this command and provide both file and dict input for those XTable config in YAML files.

Use case/motivation

AWS provides an example XTableOperator for XTable. This blog has good explanation about the Open table formats XTable provides. While this example operator is essentially an MVP version, and serves as an MWAA plugin. We can create Apache XTable provider making it available for more Airflow users, and providing more flexible user input.

Related issues

No response

Are you willing to submit a PR?

Code of Conduct

raphaelauv commented 2 months ago

Apache Airflow is an orchestration tool not a compute engine or ETL tool

if you need to run custom code like this jar , trigger the run with a bashoperator or a KubernetesPodOperator

and if you want a "friendly" Xtable trigger operator

than extand the BashOperator or KubernetesPodOperator

example

class EasyXTableOperator(BashOperator):
    ...
gyli commented 2 months ago

Hi @raphaelauv, I agree that such operator is a "friendly" XTable trigger and can be built on top of BashOperator. The example code you showed is exactly how I will build it, but my point here is it should be an Airflow community managed provider.

Just like Airflow also offers Iceberg hook and DatabricksSQLOperator in the corresponding providers, I believe XTable should also be added, as data engineering industry is embracing unified data format. For example, Microsoft Fabric and OneLake is adopting XTable. I can understand that this Apache project is still in incubating stage, and Airflow might want to hold until it's closer to an industry standard. While from a data engineer perspective, such unified datalake format is the way we are heading to, and I don't see a reason asking users to create custom operator for such work.

raphaelauv commented 2 months ago

so you want to add a XTable hook ?

gyli commented 2 months ago

At this stage I'm only thinking about building a XTableOperator, similar to the example operator that AWS provides for MWAA.

potiuk commented 2 months ago

Read https://github.com/apache/airflow/blob/main/PROVIDERS.rst#accepting-new-community-providers about the process on how new providers are accepted here and feel free to follow it. Since this is not an issue or feature - converting that into discussion.