apache / dolphinscheduler

Apache DolphinScheduler is the modern data orchestration platform. Agile to create high performance workflow with low-code
https://dolphinscheduler.apache.org/
Apache License 2.0
12.4k stars 4.49k forks source link

[Feature][Task instance] Automatic cleaning of database task instance table #16145

Open pinkfloyds opened 2 weeks ago

pinkfloyds commented 2 weeks ago

Search before asking

Description

After long-term use of DolphinScheduler, the task instance table accumulates a large amount of data, which can lead to slow queries. It would be beneficial to have a feature that automatically cleans up logs, with a precise whitelist/blacklist mechanism at the project/ProcessDefinition level. After simple configuration, it could automatically clean up periodically (every few months).  I would like to contribute this feature. I have a question: would it be more appropriate for this feature to be integrated as a system-level function of DS or to be configured as ProcessDefinition?

Use case

No response

Related issues

No response

Are you willing to submit a PR?

Code of Conduct

EricGao888 commented 2 weeks ago

I think you need to give a more detailed design. Configuring on system level / workflow definition level both have pros and cons. The question is too general for reviewers to answer without a detailed design.

EricGao888 commented 2 weeks ago

BTW, you need to submit a DISP for any potential modifications on DS core.

https://github.com/apache/dolphinscheduler/issues/14102

zhuxt2015 commented 2 weeks ago

I recommend using mysql Event Scheduler to implement automatically delete task instances

davidzollo commented 2 weeks ago

good idea, please design a proposal first

pinkfloyds commented 2 weeks ago

Plan One: Configure Workflow:

# black list
delete from t_ds_task_instance
       where (project_code in ($project_code)
          or process_instance_id in (select id
                                     from t_ds_process_instance
                                     where process_definition_code in ($process_definition_code)))
                 and start_time < NOW() - INTERVAL 3 MONTH;
# white list
delete from t_ds_task_instance
       where (project_code not in ($project_code) or
          process_instance_id not in (select id from t_ds_process_instance
                                            where process_definition_code in ($process_definition_code)))
      and start_time < NOW() - INTERVAL 3 MONTH;

Choose either a blacklist or a whitelist, retain data from the past three months, modify as needed.fill in the corresponding parameters for $project_code and $process_definition_code, and remove the corresponding conditions if they are not needed, separated by English commas. Set up an SQL node to implement the above functionality, the metadatabase needs to be registered in the database center, and configure it to run as a workflow on the 1st of every month, then export it as a JSON file for download and use. User documentation needs to be written for manual configuration, there are many manual operations, if used improperly, unexpected situations may occur, there is a slight threshold to use, and the development cycle is short (basically developed).

Plan Two: Integration at the System Level: Add an item "Log Cleaning Policy" to the menu, and implement the above functionality on the interface, all users can use it directly, but the development cycle is longer.

pinkfloyds commented 1 week ago

How about this plan? Optimization Plan One:

After the workflow development is completed, it will be added to the initialization script of DS's database, along with detailed documentation, so that all users can easily use it with simple configuration.

eelnomel commented 1 week ago

I used a python script to connect to the metadatabase to delete the data in the t_ds_task_instance table, which will cause the scheduler of the workflow to be invalid and the workflow cannot be offline. Have you done a test ?my version:3.0.1,

SbloodyS commented 6 days ago

Plan One: Configure Workflow:

# black list
delete from t_ds_task_instance
       where (project_code in ($project_code)
          or process_instance_id in (select id
                                     from t_ds_process_instance
                                     where process_definition_code in ($process_definition_code)))
                 and start_time < NOW() - INTERVAL 3 MONTH;
# white list
delete from t_ds_task_instance
       where (project_code not in ($project_code) or
          process_instance_id not in (select id from t_ds_process_instance
                                            where process_definition_code in ($process_definition_code)))
      and start_time < NOW() - INTERVAL 3 MONTH;

Choose either a blacklist or a whitelist, retain data from the past three months, modify as needed.fill in the corresponding parameters for $project_code and $process_definition_code, and remove the corresponding conditions if they are not needed, separated by English commas. Set up an SQL node to implement the above functionality, the metadatabase needs to be registered in the database center, and configure it to run as a workflow on the 1st of every month, then export it as a JSON file for download and use. User documentation needs to be written for manual configuration, there are many manual operations, if used improperly, unexpected situations may occur, there is a slight threshold to use, and the development cycle is short (basically developed).

Plan Two: Integration at the System Level: Add an item "Log Cleaning Policy" to the menu, and implement the above functionality on the interface, all users can use it directly, but the development cycle is longer.

Simply deleting these tables is not enough. There are already workflow definitions and workflow instances api for deleting. We can create a sample workflow that contains the data that needs to be deleted according to certain filtering criteria and call the deletion API to delete the data. I think this is a good practice for both user intrusion and iterative updates.

pinkfloyds commented 5 days ago

Plan One: Configure Workflow:

# black list
delete from t_ds_task_instance
       where (project_code in ($project_code)
          or process_instance_id in (select id
                                     from t_ds_process_instance
                                     where process_definition_code in ($process_definition_code)))
                 and start_time < NOW() - INTERVAL 3 MONTH;
# white list
delete from t_ds_task_instance
       where (project_code not in ($project_code) or
          process_instance_id not in (select id from t_ds_process_instance
                                            where process_definition_code in ($process_definition_code)))
      and start_time < NOW() - INTERVAL 3 MONTH;

Choose either a blacklist or a whitelist, retain data from the past three months, modify as needed.fill in the corresponding parameters for $project_code and $process_definition_code, and remove the corresponding conditions if they are not needed, separated by English commas. Set up an SQL node to implement the above functionality, the metadatabase needs to be registered in the database center, and configure it to run as a workflow on the 1st of every month, then export it as a JSON file for download and use. User documentation needs to be written for manual configuration, there are many manual operations, if used improperly, unexpected situations may occur, there is a slight threshold to use, and the development cycle is short (basically developed). Plan Two: Integration at the System Level: Add an item "Log Cleaning Policy" to the menu, and implement the above functionality on the interface, all users can use it directly, but the development cycle is longer.

Simply deleting these tables is not enough. There are already workflow definitions and workflow instances api for deleting. We can create a sample workflow that contains the data that needs to be deleted according to certain filtering criteria and call the deletion API to delete the data. I think this is a good practice for both user intrusion and iterative updates.

I understand, but there are some data operations involved, such as the need to query all workflow_definitions_id for the whitelist, and DS does not provide a relevant interface. Should we add a specific interface or directly query using SQL? Both options seem a bit troublesome. Which method is better?  I feel that implementing the above functionality with a Python script would be better than using a shell script.