apache / dolphinscheduler

Apache DolphinScheduler is the modern data orchestration platform. Agile to create high performance workflow with low-code
https://dolphinscheduler.apache.org/
Apache License 2.0
12.39k stars 4.49k forks source link

[Feature][Flink-CDC] Support FlinkCDC Task #16207

Closed kissycn closed 2 days ago

kissycn commented 4 days ago

Search before asking

Description

Apache DolphinScheduler is a distributed workflow task scheduling system that supports various types of data workflow tasks. Currently, DolphinScheduler supports basic Flink tasks, but is not yet able to directly support FlinkCDC tasks. FlinkCDC is a library commonly used for real-time data synchronization and change data capture (CDC), which is particularly important for modern data architecture. Currently, FlinkCDC3.0 already supports submitting CDC tasks in the Pipeline mode, which will be very suitable for submitting tasks as a Task plug-in for DolphinScheduler.

Use case

In the task submission page, in addition to the dolphinscheduler basic parameters, you can submit the Flink-CDC Pipeline yaml declaration file

   source:
     type: mysql
     hostname: localhost
     port: 3306
     username: root
     password: 123456
     tables: app_db.\.*

   sink:
     type: doris
     fenodes: 127.0.0.1:8030
     username: root
     password: ""

   pipeline:
     name: Sync MySQL Database to Doris
     parallelism: 2

Related issues

No response

Are you willing to submit a PR?

Code of Conduct

davidzollo commented 4 days ago

Would you like to implement this feature?

kissycn commented 4 days ago

@davidzollo
Yes I will try to complete it. Please assign the task to me.

SbloodyS commented 4 days ago

Please provide full design details before you coding. @kissycn

ruanwenjun commented 2 days ago

Very -1 to this plugin.

If you only want to manage the flink job, use flink origin way is more convenient, and ds won't do better then flink portal.

Previously DS has try to support Flink/Spark streaming, but all of these plugins are demo and going to shit, even cannot run hello world now, and missing complete lifecycle management.

kissycn commented 2 days ago

Motivation In fact, the motivation for developing this plug-in in DS is very simple. DS is used as the entry point for all engines, and data synchronization, batch processing, and stream processing are uniformly managed by DS.

Problem: FlinkCDC submit tasks is:

bash ${FLINK_CDC_HOME}/bin/flink-cdc.sh mysql-to-doris.yaml

Currently, flink-cdc.sh does not support task cancellation, which requires the use of Flink's API. Therefore, if you want to develop this plug-in in DS, use the FlinkCDC handle shell script to submit tasks, and cancel using: Flink API to cancel.

Solution: As @ruanwenjun said, there are some problems with the lifecycle management of some streaming tasks. From the source code, we know that after the Flink task is submitted, the shell process will always hold, and the Flink task cannot be cancel. For the above problems, I think there may be the following solutions:

  1. After the Flink CDC task bash ${FLINK_CDC_HOME}/bin/flink-cdc.sh mysql-to-doris.yaml is submitted, Flink's JobId will be written to the DB. After the task is successfully submitted, it can be returned without holding the process all the time.
  2. For canceling the task, you need to get the JobId from the DB, and then call Flink's API to complete the cancel operation.
  3. Other lifecycle management will not be discussed for the time being. Let's first discuss the feasibility of the above solution.
kissycn commented 2 days ago

After comprehensive consideration, I decided to submit it using Flink Rest API. I will close the issue first and discuss it again if there is any new progress.