bryzgaloff / airflow-clickhouse-plugin

The most popular ClickHouse plugin for Airflow. 🔝 Top-1% downloads on PyPI: https://pypi.org/project/airflow-clickhouse-plugin! Based on mymarilyn/clickhouse-driver.
MIT License
137 stars 36 forks source link

Contributing a ClickHouse Provider to Airflow #82

Open vargacypher opened 5 months ago

vargacypher commented 5 months ago

Hi everyone,

I'm starting to working on this task https://github.com/apache/airflow/issues/39140 in Airflow repository. The task involves creating a ClickHouse provider similar to the clickhouse-plugin.

First, I'd like to understand:

Is it feasible to develop a ClickHouse provider for Airflow? Why wasn't the clickhouse-plugin chosen as the default provider?

Second, while I'm primarily interested in creating a provider for my own (SSHClickhouseOperator) functionalities, I believe it could be beneficial for the entire ClickHouse community. I'd be happy to collaborate on a solution that integrates with the existing work done by u guys.

Please let me know if you're open to collaboration on this project.

bryzgaloff commented 5 months ago

Hi Guilherme,

Thank you for initiating this discussion!

Is it feasible to develop a ClickHouse provider for Airflow?

Developing a ClickHouse provider for Airflow appears feasible without specific obstacles. The distinction between a provider and a plugin seems to be primarily an interface concern.

The task involves creating a ClickHouse provider similar to the clickhouse-plugin. ... Second, while I'm primarily interested in creating a provider for my own (SSHClickhouseOperator) functionalities, I believe it could be beneficial for the entire ClickHouse community.

What's the purpose of creating a new provider then instead of relying on an existing project?

Speaking about the SSH part, looks like it might be useful for a variety of operators. Therefore, rather than developing a specific SSH-ClickHouse operator, I recommend creating an SSH mixin. It would introduce SSH functionality and could be potentially plugged into any operator.

By doing so, you not only fulfil your original request but also create a reusable component that can empower various operators and providers with SSH capabilities. This approach ensures efficiency, avoids redundancy, and maximizes code reuse across the project.

Why wasn't the clickhouse-plugin chosen as the default provider?

As per the Apache Airflow issue you've referenced, the maintainers mentioned that adding a provider entails some approval overhead. The contributors community of the plugin has not yet proposed it as a candidate for provider status within the Airflow community.

I'd be happy to collaborate on a solution that integrates with the existing work done by u guys.

I am open to collaboration! 🤝 Let's define the expected deliverables with minimal redundancy and work towards integrating your functionalities with the existing framework.

vargacypher commented 5 months ago

Fantastic!

What's the purpose of creating a new provider then, instead of relying on an existing project?

I think it's just for convenience, to have more community support, and to be included in the official repository list.

I recommend creating an SSH mixin.

It's a great ideia, i'll think in something in that way. But looking for Airflow examples of mixins i found that @potiuk metioned in https://github.com/apache/airflow/issues/22670#issuecomment-1085162099, that

Mixins and mutliple inheritance are hard and it's completely counter-productive to try to support it.

bryzgaloff commented 5 months ago

I think it's just for convenience, to have more community support, and to be included in the official repository list.

What are the benefits? This project has good rating, is discoverable, SEO-optimized, and actively used and downloaded. I see only a risk of distracting the community's focus if a similar solution is introduced.

If you insist on having the provider, your effort to convert this plugin into a provider will be appreciated. This will be more productive than creating a copying functionality. Could you please suggest the steps for the conversion? I may review them to make sure no of the known dependencies break and help to perform the transition.

It's a great ideia, i'll think in something in that way. But looking for Airflow examples of mixins i found that @potiuk metioned in apache/airflow#22670 (comment), that

Inside the plugin's/provider's implementation you will still implement the SSH functionality as a mixin. I just suggest to introduce a mixin-only part as the new library/plugin instead of copying the existing functionality and adding the SSH support.

Also, I have a plenty of successful experience designing useful mixins: feel free to involve me to a review once you deliver the mixin plugin and I may help you overcome potential issues. The fact that Airflow community had poor experience maintaining mixins does not mean it is a wrong path to follow, though caveats may appear: but this is true for any code which lays a foundation for a reusable library.