[DSIP-48][Cluster Task Insights] Add a series of monitoring indicators to reflect the running status of tasks

jiangtaoNuc commented 6 months ago

Search before asking

[X] I had searched in the DSIP and found no similar DSIP.

Motivation

At present, the monitoring items on the homepage of DS scheduling tasks are too simple to provide clear insights into the overall and sub project workflow, task operation status, including statistics of abnormal situations. It is planned to add relevant analysis indicators to assist administrators, data development, and frontline operations in analyzing and adjusting the execution status. There are two dimensions. The first is the overall scheduling analysis, which is aimed at cluster administrators. They need to pay attention to the number of projects currently scheduled, the number of online workflows, as well as the daily successful scheduling, the distribution of hourly level scheduling tasks, how many tasks are successfully retried, and which tasks run for a long time and fail more times around the task level. The purpose of this dimension is to enable cluster administrators to quickly determine the operation status and task distribution of the scheduling system, and provide improvement suggestions to various project developers. The second dimension is project analysis, which is aimed at the administrators of a certain project. Currently, project settings generally have a certain degree of logic, including layering or independent operation according to business scenarios. It is necessary to pay attention to the workflow situation, task situation, hourly adjustment distribution, etc. of the project. Based on the task level, it is important to consider which tasks have longer running times and more failures

Design Detail

The list of planned indicators is shown in the following figure Numerical type is presented in the form of numerical cards during the development process, with trend proportions planned through discounting or bar charts, and lists presented in the form of bar charts.

Compatibility, Deprecation, and Migration Plan

No response

Test Plan

No response

Code of Conduct

[X] I agree to follow this project's Code of Conduct

jiangtaoNuc commented 6 months ago

The first image is the overall scheduling and monitoring of the overall project, and the second image is the monitoring of the overall project. The following are some things to note:,

Try to avoid processing data separately and summarize the results from existing DS metadata tables during queries
Considering the situation where some users have a large number of task instances in their production environment, excessive metrics can lead to slow queries. So the calculation of indicators should try not to associate too many tables, and for trend indicator levels, especially for multi day task instance statistics, switches need to be added to allow users to choose whether to enable configuration.

XIJIU123 commented 5 months ago

numeric value

Number of projects

GET /firstPage/query-project-num

parameter：empty

Return value case：

{
  "code": 0,
  "msg": "成功",
  "data": 25,
  "failed": false,
  "success": true
}

Total workflows, number of online workflows, number of lost workflows

GET /firstPage/query-process-num

parameter：empty

Return value case：

{
    "code": 0,
    "msg": "成功",
    "data": {
        "result": [
            {
                "proc_status": 0,
                "proc_count": 475
            },
            {
                "proc_status": 1,
                "proc_count": 599
            }
        ]
    },
    "failed": false,
    "success": true
}

Parameter description：

proc_status：0 indicates the online workflow, 1 indicates the total workflow, and 2 indicates the lost workflow

proc_count：the number of workflows

The number of online tasks

GET /firstPage/query-task-num

Parameter：empty

Return value case：

{
    "code": 0,
    "msg": "成功",
    "data": 5756,
    "failed": false,
    "success": true
}

The number of scheduled tasks, the number of successfully scheduled tasks, and the number of tasks that were successfully scheduled yesterday

GET /firstPage/query-scheduler-num

Parameter：empty

Return value case：

{
    "code": 0,
    "msg": "成功",
    "data": {
        "finishSchedulerNum": 8749,
        "yesterdaySchedulerNum": 8723,
        "totalSchedulerNum": 13638
    },
    "failed": false,
    "success": true
}

Parameter description：

finishSchedulerNum：Today's successful dispatch counts

totalSchedulerNum：The number of tasks that should be scheduled

yesterdaySchedulerNum：The number of successfully scheduled tasks yesterday

manifest

Top 5 Tasks in Running Duration

GET /firstPage/query-timeouttask-top

Parameter：

startDate:（must，type:string，Non-null),start time.

endDate:（must，type:string，Non-null),End time.

Return value case：

{
    "code": 0,
    "msg": "成功",
    "data": [
        {
            "name": "dwi_breed_estrus_qs",
            "count": 0,
            "duration": 468
        }
    ],
    "failed": false,
    "success": true
}

Parameter description：

name：the name of the task

count：the number of executions

duration：time spent (minutes)

Top 5 Failed Tasks

GET /firstPage/query-failtask-top

Parameter：

startDate:（must，type:string，Non-null),start time.

endDate:（must，type:string，Non-null),End time.

Return value case：

{
    "code": 0,
    "msg": "成功",
    "data": [
        {
            "name": "dwi_breed_estrus_qs",
            "count": 0,
            "duration": 468
        }
    ],
    "failed": false,
    "success": true
}

Parameter description：

duration: time spent (minutes)）

Trends (to be determined)

Task status trends

GET /firstPage/query-task-status-num

Parameter：

startDate:(must,type:string,Non-null),start time.

endDate:(must,type:string,Non-null),End time.

projectCode: (must, string, can be empty), end time.

Return value case：

{
    "code": 0,
    "msg": "成功",
    "data": {
        "x": [
            0,
            "...",
            23
        ],
        "y": [
            {
                "data": [
                    0,
                    "...",
                    0
                ],
                "name": "成功"
            },
            {
                "data": [
                    0,
                    "...",
                    0
                ],
                "name": "失败"
            },
            {
                "data": [
                    0,
                    "...",
                    0
                ],
                "name": "停止"
            },
            {
                "data": [
                    0,
                    "...",
                    0
                ],
                "name": "其他"
            },
            {
                "data": [
                    0,
                    "...",
                    0
                ],
                "name": "全部"
            }
        ]
    },
    "failed": false,
    "success": true
}

Parameter description：

x: x-axis coordinates

y: y-axis coordinate

data: data content

ruanwenjun commented 4 months ago

Is there any detail design? I still don't know how you calculate these metrics, and why these metrics is important. Is there any change related to the database? As the describe, these metrics will displayed at homepage? Currently designed of storage is not suitable for OLAP, any olap query will bring big pressure to database.

SbloodyS commented 4 months ago

Is there any detail design? I still don't know how you calculate these metrics, and why these metrics is important. Is there any change related to the database? As the describe, these metrics will displayed at homepage? Currently designed of storage is not suitable for OLAP, any olap query will bring big pressure to database.

+1

zhuxt2015 commented 4 months ago

We discussed the specific design with some members of the community today, which is summarized below

Put the numerical indicators on the home page, and the trend indicators on the monitoring module
The time field of task statistics uses the start_time field
For the trend indicator interface, the time parameters in the interface, and the return data should include time field, dimension field, and indicator values

Gallardot commented 4 months ago

Is there any detail design? I still don't know how you calculate these metrics, and why these metrics is important. Is there any change related to the database? As the describe, these metrics will displayed at homepage? Currently designed of storage is not suitable for OLAP, any olap query will bring big pressure to database.

+1

Gallardot commented 4 months ago

I will vote -1.

There are a plethora of metrics here that could potentially strain the database. Moreover, unifying the calculation standards for different metrics is quite challenging, as various users or companies have their own interpretations of these standards.

Why don't we implement some basic metrics through the current metric module? This approach is simpler, more flexible, and easier to expand. The collection of basic metrics is crucial for DolphinScheduler, as these metrics help us better understand the system's operational status.

Users can aggregate these metrics using Prometheus's query language, PromQL, for instance: count, sum, avg, max, min, percentile, topk, bottomk, etc. This allows users to implement system monitoring and alerts.

Data metric visualization can be achieved through Grafana. DolphinScheduler can support embedding a Grafana dashboard through iframe on the homepage to display monitoring data.

SbloodyS commented 4 months ago

We discussed the specific design with some members of the community today, which is summarized below

Put the numerical indicators on the home page, and the trend indicators on the monitoring module

The time field of task statistics uses the start_time field

For the trend indicator interface, the time parameters in the interface, and the return data should include time field, dimension field, and indicator values

I'm +1 of increasing operational metrics. This is a great help in enhancing our observability. But in the whole description, I don't see a description of the implementation architecture.

If the way to achieve this is to use SQL to do aggregate statistics in the database to get the results of these indicators. This implementation is not accepted. This can have a devastating effect on the database load and directly affect the scheduling stability. I'm strongly -1 on this way.

My suggestion is to use Prometheus for metrics, grafana for presentation, and DS to embed grafana pages in the frontend to ensure unity. This is very low intrusion for DS, while also taking into account performance and scalability. For the new indicators in the future, only grafana dashboard needs to be modified, and there is no need to make too many modifications to DS.

sdhzwc commented 4 months ago

I think a lot of times, users don't want to add extra Prometheus, they just want to see what's going on with the system with what's already there. So I think an on/off button could be added, leaving it up to the user whether to turn it on or not.

我认为很多时候，用户不想额外增加Prometheus，只想用现有的条件来观察系统的情况。所以我觉得可以增加一个开关按钮，是否开启交给用户。

SbloodyS commented 4 months ago

At present, the main task of the community is to build a stable, scalable and high-performance scheduling system. To achieve this goal, boundaries need to be set for new functionality. Prometheus is the most popular monitoring solution in the industry today. This is also a feature that most users expected.

However, a few users who do not want to use Prometheus and only want to use SQL to perform statistical queries on the database is not robust, scalable, and may irreparably affect the stability of the core scheduling. This is not a feature the community currently expects.

XIJIU123 commented 4 months ago

sql performance test

Test database resources：2C，4G Number of test database data： 1、 project num：28 2、 user num：35 3、 process definition num：1000 4、 task definition num：5800 5、 task instance num：12800	type	interfaces	Approximate average time	sql
numeric value	Number of projects	100ms	SELECT COUNT(*) from ( select distinct project_id from t_ds_project p,t_ds_relation_project_user rel where p.id = rel.project_id and rel.user_id= 2 UNION ALL select distinct id from t_ds_project where user_id= 2 ) result;
numeric value	Total workflows, number of online workflows	100ms	select release_state as proc_status,count(*) as proc_count from t_ds_process_definition group by release_state;
numeric value	The number of online tasks	100ms	select count(distinct b.post_task_code) from (select user_id,project_id from t_ds_relation_project_user where user_id=2 group by user_id,project_id)a join (select id,code from t_ds_project)c on a.project_id=c.id join (select project_code,post_task_code from t_ds_process_task_relation)b on c.code = b.project_code;
numeric value	The number of scheduled tasks, the number of successfully scheduled tasks, and the number of tasks that were successfully scheduled yesterday	3.3s	select count(*) from t_ds_process_instance instance, t_ds_process_definition define, t_ds_task_instance tins, t_ds_project project where instance.schedule_time is not null and instance.process_definition_code = define.code and tins.process_instance_id = instance.id and project.code = define.project_code and instance.schedule_time > '2024-07-15 00:00:00' and instance.schedule_time < '2024-07-16 00:00:00';	It takes about 300ms to query the number of scheduled tasks only, and about 3s to calculate the number of tasks that should be scheduled.
manifest	Top 5 Tasks in Running Duration	200ms	select name, duration from ( select a.process_definition_code, AVG(timestampdiff(MINUTE,a.start_time,a.end_time)) duration from t_ds_process_instance a,t_ds_process_definition b, t_ds_project c where a.schedule_time is not null and a.process_definition_code = b.code and c.code = b.project_code and a.start_time >='2024-07-15 00:00:00' and start_time <='2024-07-16 00:00:00' group by a.process_definition_code order by duration desc, process_definition_code asc limit 5 ) tmp left join t_ds_process_definition c on c.code = tmp.process_definition_code
manifest	Top 5 Failed Tasks	200ms	select c.name, tmp.count from ( select a.process_definition_code, count(*) count from t_ds_process_instance a, t_ds_process_definition b, t_ds_project c where a.schedule_time is not null and a.process_definition_code = b.code and c.code = b.project_code and a.start_time >='2024-07-15 00:00:00' and start_time <='2024-07-16 00:00:00' group by a.process_definition_code order by count desc, process_definition_code asc limit 5 ) tmp left join t_ds_process_definition c on tmp.process_definition_code = c.code;
Trends	Task status trends	200ms	select name, hh, sum(`sum`) as `value` from ( select n.id, case state when 1 then '正在运行' when 5 then '停止' when 6 then '失败' when 7 then '成功' else '其他' end as name, m. from ( select j.project_code, j.hh, j.state, sum(j.cnt) as sum from ( select b.project_code, a.state, a.hh, a.cnt from ( select task_code, state, hh, count() cnt from( select task_code, state, hour(DATE_ADD(start_time,INTERVAL 14 HOUR)) hh from t_ds_task_instance where start_time >='2024-07-15 00:00:00' and start_time <='2024-07-16 00:00:00' ) k group by k.task_code, k.state, k.hh ) a left join t_ds_task_definition b on a.task_code = b.code ) j where j.project_code is not null group by j.project_code, j.state, j.hh ) m left join t_ds_project n on m.project_code = n.code ) p group by p.hh, p.name order by p.hh;	A day's worth of statistics by hourly dimension.

apache / dolphinscheduler