apache / incubator-xtable

Apache XTable (incubating) is a cross-table converter for lakehouse table formats that facilitates interoperability across data processing systems and query engines.
https://xtable.apache.org/
Apache License 2.0
848 stars 140 forks source link

Make XTable as a community managed Airflow provider #495

Open gyli opened 2 months ago

gyli commented 2 months ago

Feature Request / Improvement

Hi XTable maintainers,

I am planning to create an Airflow operator for XTable, and also trying to make it as a community managed Airflow provider.

By saying Airflow operator, what I mean is something similar to what AWS presents in this blog, which is a wrapper of XTable's java command, allowing users to trigger it with config in Python codes and as an Airflow task. I believe integrating XTable in Airflow has great benefits for making it popular, and closer to be an industry standard.

I've tried proposing this with Airflow directly, while it requires votes to add it as a provider. More importantly, they are also looking for support from XTable (or maybe even OneHouse?) directly, since they prefer "mixed governance" approach. As an example, here is the discussion in Airflow devlist about adding a new provider. Hence, I am requesting your support to bring this discussion on the table of both sides, provide more background and evidence why XTable is helpful for data engineers (who are highly possibly Airflow users as well), and support such vote in Airflow devlist.

Are you willing to submit PR?

Code of Conduct

the-other-tim-brown commented 1 month ago

@gyli I like this idea and it seems like a natural way for Airflow users to sync their tables after some other step has run in their Airflow pipelines. Onehouse does not own XTable since it is an Apache Incubating project so I don't know how that will work with the mix governance proposed. I also lack the experience with Airflow to know what it would mean to build and maintain an operator.

gyli commented 1 month ago

To provide more details and examples of Airflow providers:

  1. Here is an Airflow provider for Spark, which I think could be a good example since it's essentially also a wrapper of Spark commands.
  2. AWS has a demo for the XTable operator https://github.com/aws-samples/apache-xtable-on-aws-samples, while it works as an MWAA plugin. The core operator logic should be very similar.
  3. Here is Airflow's doc about the process adding a new provider
gyli commented 1 month ago

Also, I can take the implementation of the operator, while I would like to put it on hold until there is some progress of the discussion with Airflow team. Can we bring some more attention and discuss it within XTable maintainers as the first step?

vinothchandar commented 1 month ago

I've tried proposing this with Airflow directly, while it requires votes to add it as a provider.

Thanks for bringing this up, @gyli . it's a great idea to have the conversion run at the end of airflow DAGs.

Happy to help this make progress. Do you have a dev list thread or a GH issue on Airflow, where you have brought this up with Airflow maintainers? If so, easiest would be to chime in there, understand what needs to be done/overall process.

vinothchandar commented 1 month ago

This looks like the process? https://github.com/apache/airflow/blob/main/PROVIDERS.rst#accepting-new-community-providers

gyli commented 1 month ago

The above doc is the correct process to add a new provide.

I have started a discussion here, but they need an official proposal and voting in Airflow devlist.

vinothchandar commented 1 month ago

@gyli I was off. will get on this. next week. Thanks for your patience

gyli commented 1 month ago

Awesome. I was about to send out the email to their devlist, but it would be much better if you can send out. Thanks.