apache / seatunnel

SeaTunnel is a next-generation super high-performance, distributed, massive data integration tool.
https://seatunnel.apache.org/
Apache License 2.0
8k stars 1.81k forks source link

[Feature][spark-transform-code] Transform via user define code #1901

Open smokeriu opened 2 years ago

smokeriu commented 2 years ago

Search before asking

Description

Allow users to upload a custom Java Code file . and call it like a UDF.

Usage Scenario

Related issues

No response

Are you willing to submit a PR?

Code of Conduct

ruanwenjun commented 2 years ago

Sounds good.

Hisoka-X commented 2 years ago

Why upload java file than upload jar? In jar, you will have all dependcy which you should use. In java file is hard to do that.

Hisoka-X commented 2 years ago
  • I plan to use, for the time being, the CodeGenerator that Spark has implemented . It has been fully tested.

If use this, maybe we should create a shade dependcy, because SeaTunnel core logic should not depend on engine code.

smokeriu commented 2 years ago

Why upload java file than upload jar? In jar, you will have all dependcy which you should use. In java file is hard to do that.

Sometimes it is difficult to do some work through SQL alone, but it becomes easier through Java. Sometimes the user just needs simple code to get the job done. At this point, I think it will increase the workload if you do it by uploading jar. Because when using javacode, the user does not need to do the work of packaging and so on. Of course, the disadvantage is that the user can only use the dependencies that already exist in our app. However, in the future, we can provide --jars entry to the user, then the user will be able to use other dependencies in the code.

smokeriu commented 2 years ago
  • I plan to use, for the time being, the CodeGenerator that Spark has implemented . It has been fully tested.

If use this, maybe we should create a shade dependcy, because SeaTunnel core logic should not depend on engine code.

As I envision it now, it's just a Spark Transform, so we can start with the dependencies that Spark already has.And can use some of the methods/tools already implemented by Spark. Flink or a generic implementation may have to be discussed more, as I haven't worked on it for Flink before.

ruanwenjun commented 2 years ago

Why upload java file than upload jar? In jar, you will have all dependcy which you should use. In java file is hard to do that.

Sometimes it is difficult to do some work through SQL alone, but it becomes easier through Java. Sometimes the user just needs simple code to get the job done. At this point, I think it will increase the workload if you do it by uploading jar. Because when using javacode, the user does not need to do the work of packaging and so on. Of course, the disadvantage is that the user can only use the dependencies that already exist in our app. However, in the future, we can provide --jars entry to the user, then the user will be able to use other dependencies in the code.

If so, I would like to split the transform from our distribution like source/sink, user can olny need to add seatunnel-api-xx to their new transform plugin, and put the plugin into transform directory, seatunnel will load it atomically.

smokeriu commented 2 years ago

If so, I would like to split the transform from our distribution like source/sink, user can olny need to add seatunnel-api-xx to their new transform plugin, and put the plugin into transform directory, seatunnel will load it atomically.

It is a good idea. Users can implement their own algorithms by extends BaseTransform, etc. But for this Issue, do you think there is a need to implement it. The difference is that the user only needs a single code.java instead of packaging the algorithm. I think it will be more useful in test scenarios and simple scenarios

Hisoka-X commented 2 years ago

Why upload java file than upload jar? In jar, you will have all dependcy which you should use. In java file is hard to do that.

Sometimes it is difficult to do some work through SQL alone, but it becomes easier through Java. Sometimes the user just needs simple code to get the job done. At this point, I think it will increase the workload if you do it by uploading jar. Because when using javacode, the user does not need to do the work of packaging and so on. Of course, the disadvantage is that the user can only use the dependencies that already exist in our app. However, in the future, we can provide --jars entry to the user, then the user will be able to use other dependencies in the code.

In my view, if user start write code, they must use IDE like idea or eclipse. The package isn't a big problem. Only with a java file, the user would not know the code can run success before submit.