[Feature] dss single point operation is modified to be multi-active

Search before asking

[X] I had searched in the issues and found no similar feature requirement.

Problem Description

At present, the multiple Microservices deployed by DSS in each environment are single nodes. No matter service exceptions or host exceptions, there is a great risk of service unavailability, which affects the availability of the entire product. And during version upgrades, all services need to be stopped for 1-2 hours each time, which can also affect the user experience to a certain extent. Therefore, it is necessary to transform all Microservices of the DSS into a multi live mode to ensure that the DSS service is still available when an exception occurs at a node.

Description

Realize the multi active deployment of DSS to ensure that during the maintenance of a certain set of service machines, the services of other machines can be used as usual without affecting users and without their perception. Based on this, a complete multi activity deployment plan needs to be provided. If a certain service is abnormal during the publishing process, an error message will be returned indicating that the system has taken a nap. Please try again later.

Use case

No response

solutions

1. Overall design To transform DSS from only supporting single node deployment to supporting multi node multi activity deployment, the points to be considered include: data sharing and synchronization, data consistency, load balancing and failover, and service discovery and registration. The latter two can directly reuse the existing functions of Linkis. DSS needs to care about two parts: whether cache is involved in the process of service invocation, and whether cache is involved in each Microservices, To avoid data inconsistency; The second is the tasks executed in the service, such as executing workflow or node tasks, publishing workflow tasks, copying workflow or project tasks, workflow import and export tasks, etc., to prevent abnormal task states of nodes from being returned to users. 1.1 Technical Architecture

1.2 Business architecture

From the user's perspective, there is no perception of whether the backend service is a single node or multiple nodes, so the business architecture remains unchanged.

2. Module design

Since the service has been merged into two services in the Microservices merging, and there are no cache related calls between the two services, the cache problem does not need to be considered. Therefore, the focus is on the various tasks executed in a single service, because when a node has certain executing tasks, if the node encounters an exception at this time, it must provide feedback to the user that the task has failed through other nodes. Here, a regular inspection method is adopted to check the status of the task and save the status to the database for return to the user. This scheduled task is controlled through parameter configuration and is executed every 60 seconds by default.

2.1 Workflow Publishing Tasks 2.1.1 Open source workflow conversion Due to the fact that there is no publish operation in the open source version and only the DSS workflow is converted into a scheduling system workflow, it is necessary to save the task state in the OrchestratorConversionJob. As the existing code only saves the state of the job in the cache, the job state needs to be stored in the database. Here, dss orchestrator job_ info table is reused. The scheduled task at this location is CheckOrchestratorConversionJobTask, defined in the Orchestrator server module. convertjob

In the first step, if all the instances obtained are alive, you can return directly, otherwise save the instance information; The second step is to obtain information about tasks being executed or initialized from the dss_orchestrator_job_info table. The third step is to compare the instance information. If the instance of a task that is being executed does not exist on Eureka, then the status of these tasks needs to be updated to failed. Step 4 Update the task status information; Step 5: If a node is abnormal, you need to send an alarm message to the developer, including the information about failed tasks on the node.

It should be noted that in the ConvertOrchestration method of OrchestratorPluginServiceImpl, the current instance needs to be obtained through the Sender.getThisInstance method and saved to the table dss Orchestrator Job_Info At the same time, this table will save the information of the conversion workflow task, and then update the information of the conversion workflow task in the OrchestratorConversionJob.

The existing dss_orchestrator_job_info table is used. Changes to the table add two fields, instance_name and status, and change the updated_time field to update_time.

2.2 Open Source workflow Executes tasks The existing table dss_workflow_task is used here to write the instance information, and the timed task is CheckWorkflowExecuteTask, defined in the flow-execution-server module. The entire implementation process is similar to 2.1.1. The persist method in WorkflowPersistenceEngine saves instance information, while the change method updates workflow execution information.

2.3 Open source workflow copy task Using existing table dss_orchestrator_copy_info here, in which writing instance information, timing task for CheckOrchestratorCopyTask, defined in the framework-orchestrator-server module. The entire implementation process is similar to 2.1.1. The copyOrchestrator method in OrchestratorFrameworkServiceImpl saves instance information, while OrchestratorCopyJob updates workflow copy task information.

2.4 Determine whether the scheduled cleaning of CS tasks is supported in a multi active state.

3. Data structure/storage design (determine which field to follow and modify the initialization statement for the first installation)

3.1 Workflow Publishing Tasks 3.1.1 Add fields instance_name and status in table dss_orchestrator_job_info, and change updated_time to update_time

ALTER TABLE `dss_orchestrator_job_info` ADD `instance_name` varchar(128) DEFAULT NULL COMMENT 'An instance of executing a task';
ALTER TABLE `dss_orchestrator_job_info` ADD `status` varchar(128) DEFAULT NULL COMMENT 'Transition Task Status';
ALTER TABLE `dss_orchestrator_job_info` ADD `error_msg` varchar(2048) DEFAULT NULL COMMENT 'Conversion task exception information';
ALTER TABLE `dss_orchestrator_job_info` CHANGE `updated_time` `update_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP;
ALTER TABLE `dss_orchestrator_job_info` MODIFY `job_id` varchar(64) DEFAULT NULL COMMENT 'task id';

3.2 New field instance_name in table dss_workflow_task

ALTER TABLE `dss_workflow_task` ADD `instance_name` varchar(128) DEFAULT NULL COMMENT 'An instance of executing a task' AFTER `status`;

3.3 Add instance_name in table dss_orchestrator_copy_info

ALTER TABLE `dss_orchestrator_copy_info` ADD `instance_name` varchar(128) DEFAULT NULL COMMENT 'An instance of executing a task' AFTER `status`;

3.4 DDL statements of related tables must be updated at the same time for the initial installation

Anything else

No response

Are you willing to submit a PR?

[X] Yes I am willing to submit a PR!

WeBankFinTech / DataSphereStudio