apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
37.19k stars 14.32k forks source link

Support for Trino in Airflow #14281

Closed DRavikanth closed 3 years ago

DRavikanth commented 3 years ago

Description We currently are having PrestoHook that underneath uses presto-python-client. Now with Trino (formerly known as PrestoSQL) is renamed and there are a change of headers in Trino presto-python-client will no longer work and so PrestoHook provider breaks.

What do you want to happen? Ask is to provide support for TrinoHook which uses https://github.com/trinodb/trino-python-client underneath.

boring-cyborg[bot] commented 3 years ago

Thanks for opening your first issue here! Be sure to follow the issue template!

potiuk commented 3 years ago

@DRavikanth - this is a community driven project, maybe you would like to work on this and submit PR?

potiuk commented 3 years ago

Especially if there will be no-one to pick it up quickly - this is the best way to speed it up. I am happy to help with reviewing and guiding you if needed @DRavikanth !

DRavikanth commented 3 years ago

@potiuk, I am happy to work but, I have no knowledge on this/airflow. I recently started to learn and evaluate Airflow for few of our use cases.

yes, It would be great if you can guide me along.

potiuk commented 3 years ago

yes, It would be great if you can guide me along.

I do not know much about presto - but when we join forces...

The gist of it is the Presto Operators to start using the client library you mentioned - so it has to be changed in setup.py - there is presto extra there and it has the dependencies there -> changing them to use Trino dependency is a good idea.

In fact we've already discussed renaming the provider to "trino" (we are aware of the name change) so a good start would be to find all the places where "presto" is mentioned in the code and create a new provider trino that would be just copy of presto and you could use the new dependency in the new provider instead of the old one.

That would be under "airflow/providers/presto" "tests/providers/presto", "docs/apache-airflow-providers-presto" and few other common places.

That would give a chance for Facebook to still maintain the presto provider if they want to. Then just switching imports to the new library should do the trick I believe :).

That woudl be a good start. There is some more work needed to make integration tests working (we have a docker compose to run presto database). But we can worry about it next (@mik-laj implemented even a kerberos integration there) - happy to guide you trough the process when we get there.

eladkal commented 3 years ago

There was a message about it in the mailing list : https://lists.apache.org/thread.html/r8c1f1c330113052ae0a27c55611f582dcf1aaae84cc8f0a0cc37d5a2%40%3Cdev.airflow.apache.org%3E

Assuming creating integration from scratch with https://github.com/trinodb/trino-python-client the scope is adding a new provider. A reference for adding a new provider can be found in https://github.com/apache/airflow/pull/13324 (You can start by adding only TrinoHook under the new Trino provider)

I'm not sure if it requires to discuss this again on the mailing list. It's a bit tricky as some of Trino users (like me) are currently using Presto provider and it might be difficult to notify about splitting the provider.

In any case I'd be happy to help with review as I'm running Trino and currently working with the existed PrestoHook (which uses prestodb package)

DRavikanth commented 3 years ago

Thanks for the tips @potiuk / @eladkal. I will check that and spend sometime to work on this.