apache / dolphinscheduler

Apache DolphinScheduler is the modern data orchestration platform. Agile to create high performance workflow with low-code
https://dolphinscheduler.apache.org/
Apache License 2.0
12.79k stars 4.59k forks source link

Task plugin research, out of solution(任务插件化调研,出解决方案) #201

Closed boandai closed 4 years ago

Baoqi commented 5 years ago

I saw a new project today, this code is relatively simple, which also implements the basic plugin, the code is quite clear. You can refer to it: (This is much less than the StreamSets code, it is much simpler)

Reference: https://github.com/harbby/sylph Stream computing platform for bigdata

The plugin uses the com.github.harbby.gadtry.classloader.DirClassLoader of https://github.com/harbby/gadtry to load plugins

Plugin load code: https://github.com/harbby/sylph/blob/master/sylph-main/src/main/java/ideal/sylph/main/service/PipelinePluginLoader.java

Plugin implementation code reference: ClickHouseSink: https://github.com/harbby/sylph/blob/master/sylph-connectors/sylph-clickhouse/src/main/java/ideal/sylph/plugins/clickhouse/ClickHouseSink.java    - Express the ClickHouseSink of type RealTimeSink by annotating Name and Description.    - Defined various parameters by extending PluginConfig's ClickHouseSinkConfig: jdbcUrl, user, password, query, bulkSize (bulkSize is int type, others are string), each parameter has Name and Description    - Support for defining multiple TaskTypes in a Plugin Jar. TaskType is divided into 3 categories: Source, Transform, Sink    - NOTE: For EasyScheudler, you should also consider I18N, so Name and Description should also let the plugin display multiple languages.    - NOTE: For EasyScheduler, there are many types of plugins, such as: Task type (sql task, shell task, etc.), JDBC Connector plugin (mysql, clickhouse, etc.)

The front end sets the plugin parameter: (This is because there is no local compilation/deployment, you can only look at the code). This code is in: https://github.com/harbby/sylph/blob/master/sylph-controller/src/main/webapp/app/js/etl.js#L27

But it just serializes the plugin's config into a json object, and then in the text box, the user modifies the specific parameters.   - NOTE: For EasyScheduler, we should be able to provide different UI presentations depending on the config type.   - NOTE: For EasyScheduler, we should be able to put various configs in different TabGroups, and then put each config into a different TAB, such as: basic information, "JDBC information", "advanced configuration", etc. "


今天看到了一个一个新的项目, 这个代码比较简单, 其中也实现了基本的plugin, 代码写的还挺清楚的. 可以参考一下: (这个比StreamSets的代码少太多了, 也简单很多)

参考了: https://github.com/harbby/sylph Stream computing platform for bigdata

插件使用了 https://github.com/harbby/gadtry 的 com.github.harbby.gadtry.classloader.DirClassLoader 来load plugins

plugin加载代码: https://github.com/harbby/sylph/blob/master/sylph-main/src/main/java/ideal/sylph/main/service/PipelinePluginLoader.java

plugin实现代码参考: ClickHouseSink: https://github.com/harbby/sylph/blob/master/sylph-connectors/sylph-clickhouse/src/main/java/ideal/sylph/plugins/clickhouse/ClickHouseSink.java

前端设置plugin参数: (这个由于没有在本地编译/部署, 只能大概看看代码). 这个代码在: https://github.com/harbby/sylph/blob/master/sylph-controller/src/main/webapp/app/js/etl.js#L27

但是它只是把plugin的config 序列化为一个json object, 然后在文本框中, 用户自己修改具体参数.

EricJoy2048 commented 5 years ago

任务插件化设计

自定义任务

一个自定义任务应该由三部分组成。configuration.xml和、metainfo.xml和基于AbstractTask实现的实现类XXXTask。

1. XXXTask是基于AbstractTask的实现类,实现了Task的相关接口,系统在执行该任务时实际上是通过反射来实例化该任务,然后调用执行方法。

2. configuration.xml文件主要用来描述该自定义任务类型的一些自定义的配置参数。注意,每个任务都会有一些通用参数,比如任务的名称,是否可重试,失败是否要告警,使用的资源队列等。这些参数是整个系统级别的,自定义任务是无法对这些参数做修改和定义的。

3.metainfo.xml文件,该文件里定义该任务类型的核心信息。

AbstractTask的接口定义

metainfo.xml定义

task_type_name : 自定义任务的名称,比如MR任务,SPARK任务等,这个名称会显示在流程定义是左侧可选择的任务类型列表中。

classpath : 自定义任务的实现类的路径,任务的执行器会使用该路径反射实例化具体的实现类,然后运行该实现类中的run方法。

configuration.xml定义

name : 参数的名称 value : 参数的默认值 type : 参数的类型,可以为INPUT,INPUT_LIST,SELECT,TEXTAREA,KV,RADIO.

INPUT:前端页面会将INPUT类型的参数以表单元素input渲染。 INPUT_LIST:前端页面会先渲染一个input,然后放上一个“+”号,点击加号后可以添加多个input。 KV:前端会渲染出左右两个input,左边是key,右边是该key的值,并且可以通过"+"号增加kv对。

加载自定义任务

自定义任务开发完成后打包上传到服务器,系统会加载该任务,加载的过程中会从metainfo.xml和configuration.xml中读取相关的信息,并写“任务类型定义表中”

任务类型定义表设计如下:

ID 任务类型名称 任务的classpath 任务的自定义参数

流程定义模块

任务插件化,就要求在流程定义模块,可选择的任务类型列表应该从“任务类型定义表”中读取。然后每个任务在拖入设计界面时,该任务的配置界面中的参数除了系统默认参数,其它自定义参数都应该根据“任务的自定义参数”在页面中渲染。这样可以做到前端不需要专门为每个自己定义任务做单独的任务配置页面。

流程保存

当用户在页面上选择任务类型,并根据需要填写参数完成任务配置并保存时,需要将该任务节点信息保存到“任务节点定义表”中。在该表中保存了该任务节点的ID,任务节点对应的任务类型ID,任务节点名称,任务节点的配置参数(页面上的各个参数的参数名和值)。

任务节点定义表设计如下:

ID 任务类型ID 任务节点名称 任务参数

流程执行

流程执行时,任务执行器会根据“任务节点ID”从“任务节点定义表”中找到该任务的具体定义信息,根据“任务类型ID”可以从“任务类型定义表”中找到classpath然后反射实例化该类并执行run方法。

AbstractTask类中提供getTaskVariables方法,获取到该任务的系统参数和自定义参数。在run方法中可以通过调用该方法获取到参数和值。

khadgarmage commented 5 years ago

@gaojun2048 +1, I agree with your design. I also think of a few additional points here:

  1. Task plug-in development, there are two points to consider, one is a plug-in that allows users to directly develop their own business needs, and the use is more flexible; the other is to allow open source enthusiasts to develop third-party plug-ins, so that will promote the development of dolphinscheduler, ideas are more divergent, is there a plug-in market? Therefore, based on the function of the custom task plug-in, it is necessary to support the third-party plug-in import. After the import and the custom task plug-in is a process, the workload itself is not large, but the import is added on the basis of the original.
  2. Plug-in development does not depend on java, can be any language, executable program, the scheduling platform does is transparent transmission parameters, execution and scheduling. Developers can have more choices. For example, a shell script can be packaged into a plugin using some parameters. For example, someone can write an executable file in golang, define some parameters, to make a plugin.
  3. System upgrade or migration is also a point to note, to ensure that the plug-in will still take effect after migration or upgrade.

    @gaojun2048 +1 很赞同你的需求,我这边还想到几个补充的点: 1、任务插件化开发,可以有两点考虑,一个是让使用者直接可以开发自己业务需求的插件,使用上更加灵活;另一个是可以让开源爱好者开发第三方插件,这样对ds更是一种正向的推动,思路更发散点,是不是可以有插件market。 所以在自定义任务插件的功能基础上,要支持第三方插件导入,导入后和自定义任务插件是一个流程,这个工作量本身并不大,只是在原来的基础上增加了导入。 2、自定义任务插件,可以不局限于jar包,可以是shell, python, 也可以是一个可执行文件,可以有更多的选择。比如说一个shell脚本使用一些参数,就可以封装成一个插件;同时也可以用golang写一个可执行文件,传到服务器,定义一些入参,也可以作为一个插件。 3、系统升级或者迁移也是个要注意的点,要保证迁移或者升级后,插件依然能够生效。

davidzollo commented 4 years ago

this feature please referer https://github.com/apache/incubator-dolphinscheduler/issues/2869

davidzollo commented 4 years ago

I will close this feature , the deep discussion please refer : #2869