apache / seatunnel

SeaTunnel is a next-generation super high-performance, distributed, massive data integration tool.
https://seatunnel.apache.org/
Apache License 2.0
7.95k stars 1.8k forks source link

[Feature][sql] Data Transmission based on SQL #1981

Closed xleoken closed 2 years ago

xleoken commented 2 years ago

Search before asking

Description

We know that there are many data transmission products, like Apache Flume, Apache Sqoop, Alibaba Datax, DTStack flinkx etc, we can see that more and more products support creating data transmission task through SQL configuration. So I wana to raise a topic that let SeaTunnel focus on SQL, we can get a lot of benefits from it, and this will be more in line with the goals of the project Next-generation high-performance, distributed, massive data integration framework.

The SQL is a language-integrated query that allows the composition of queries from relational operators such as selection, filter, and join in a very intuitive way. We can use catalog management to manage these sqls, and not to maintain the api configuration.

So, suggest that we can create a new branch which foucus on SQL like api-draft branch, many features need to develop quickly, like cdc, breakpoint continuation, metrics, catalog management, web ui and etc. The goal of the branch is Data Transmission based on SQL.

SeaTunnel 规划思考.pptx

Are you willing to submit a PR?

Code of Conduct

William-GuoWei commented 2 years ago

It is great idea to do the SQL-like transformation. But I did't see the idea about how to implement it on Spark, Flink, DataFusion,etc. I can only saw the way implemented on Flink. As far as I know, FlinkSQL did it very well. And FlinkSQL API is quite different with SparkSQL. I don't know how to deal with it if we only design the architect base on Flink and FlinkSQL. Perhaps Universal SQL API is needed before implement SQL on Flink, and of course some customer API can be used in some way. About this idea, you can see https://www.getdbt.com/ about how universal SQL support AWS, Google cloud and etc. I hope that can help you in some way.

xleoken commented 2 years ago

@William-GuoWei Thanks for sharing your thinking. This proposal is still in the initial stage, we can design the SeaTunnel SQL which can be adapted to FlinkSQL or SparkSQL, I think it's not important for now from my side. It seems that we had spent a lot of time to adapt multipe engines, but the movement was very slow and tortuous, so it's better to support single engine currently. For https://www.getdbt.com/ we can add more connectors to support AWS, Google cloud and etc.

Here, I drew a sketch of the architecture, seems that we should spend more time to design the whole system, not focus on mulipe engines.

image