Open jiangtaoNuc opened 6 months ago
The first image is the overall scheduling and monitoring of the overall project, and the second image is the monitoring of the overall project. The following are some things to note:,
Number of projects
GET /firstPage/query-project-num
parameter:empty
Return value case:
{
"code": 0,
"msg": "成功",
"data": 25,
"failed": false,
"success": true
}
Total workflows, number of online workflows, number of lost workflows
GET /firstPage/query-process-num
parameter:empty
Return value case:
{
"code": 0,
"msg": "成功",
"data": {
"result": [
{
"proc_status": 0,
"proc_count": 475
},
{
"proc_status": 1,
"proc_count": 599
}
]
},
"failed": false,
"success": true
}
Parameter description:
proc_status:0 indicates the online workflow, 1 indicates the total workflow, and 2 indicates the lost workflow
proc_count:the number of workflows
The number of online tasks
GET /firstPage/query-task-num
Parameter:empty
Return value case:
{
"code": 0,
"msg": "成功",
"data": 5756,
"failed": false,
"success": true
}
The number of scheduled tasks, the number of successfully scheduled tasks, and the number of tasks that were successfully scheduled yesterday
GET /firstPage/query-scheduler-num
Parameter:empty
Return value case:
{
"code": 0,
"msg": "成功",
"data": {
"finishSchedulerNum": 8749,
"yesterdaySchedulerNum": 8723,
"totalSchedulerNum": 13638
},
"failed": false,
"success": true
}
Parameter description:
finishSchedulerNum:Today's successful dispatch counts
totalSchedulerNum:The number of tasks that should be scheduled
yesterdaySchedulerNum:The number of successfully scheduled tasks yesterday
Top 5 Tasks in Running Duration
GET /firstPage/query-timeouttask-top
Parameter:
startDate:(must,type:string,Non-null),start time.
endDate:(must,type:string,Non-null),End time.
Return value case:
{
"code": 0,
"msg": "成功",
"data": [
{
"name": "dwi_breed_estrus_qs",
"count": 0,
"duration": 468
}
],
"failed": false,
"success": true
}
Parameter description:
name:the name of the task
count:the number of executions
duration:time spent (minutes)
Top 5 Failed Tasks
GET /firstPage/query-failtask-top
Parameter:
startDate:(must,type:string,Non-null),start time.
endDate:(must,type:string,Non-null),End time.
Return value case:
{
"code": 0,
"msg": "成功",
"data": [
{
"name": "dwi_breed_estrus_qs",
"count": 0,
"duration": 468
}
],
"failed": false,
"success": true
}
Parameter description:
name: the name of the task
count: the number of executions
duration: time spent (minutes))
Task status trends
GET /firstPage/query-task-status-num
Parameter:
startDate:(must,type:string,Non-null),start time.
endDate:(must,type:string,Non-null),End time.
projectCode: (must, string, can be empty), end time.
Return value case:
{
"code": 0,
"msg": "成功",
"data": {
"x": [
0,
"...",
23
],
"y": [
{
"data": [
0,
"...",
0
],
"name": "成功"
},
{
"data": [
0,
"...",
0
],
"name": "失败"
},
{
"data": [
0,
"...",
0
],
"name": "停止"
},
{
"data": [
0,
"...",
0
],
"name": "其他"
},
{
"data": [
0,
"...",
0
],
"name": "全部"
}
]
},
"failed": false,
"success": true
}
Parameter description:
x: x-axis coordinates
y: y-axis coordinate
data: data content
name: task state type
Is there any detail design? I still don't know how you calculate these metrics, and why these metrics is important. Is there any change related to the database? As the describe, these metrics will displayed at homepage? Currently designed of storage is not suitable for OLAP, any olap query will bring big pressure to database.
Is there any detail design? I still don't know how you calculate these metrics, and why these metrics is important. Is there any change related to the database? As the describe, these metrics will displayed at homepage? Currently designed of storage is not suitable for OLAP, any olap query will bring big pressure to database.
+1
We discussed the specific design with some members of the community today, which is summarized below
Is there any detail design? I still don't know how you calculate these metrics, and why these metrics is important. Is there any change related to the database? As the describe, these metrics will displayed at homepage? Currently designed of storage is not suitable for OLAP, any olap query will bring big pressure to database.
+1
I will vote -1.
There are a plethora of metrics here that could potentially strain the database. Moreover, unifying the calculation standards for different metrics is quite challenging, as various users or companies have their own interpretations of these standards.
Why don't we implement some basic metrics through the current metric module? This approach is simpler, more flexible, and easier to expand. The collection of basic metrics is crucial for DolphinScheduler, as these metrics help us better understand the system's operational status.
Users can aggregate these metrics using Prometheus's query language, PromQL, for instance: count, sum, avg, max, min, percentile, topk, bottomk, etc. This allows users to implement system monitoring and alerts.
Data metric visualization can be achieved through Grafana. DolphinScheduler can support embedding a Grafana dashboard through iframe on the homepage to display monitoring data.
We discussed the specific design with some members of the community today, which is summarized below
- Put the numerical indicators on the home page, and the trend indicators on the monitoring module
- The time field of task statistics uses the start_time field
- For the trend indicator interface, the time parameters in the interface, and the return data should include time field, dimension field, and indicator values
I'm +1 of increasing operational metrics. This is a great help in enhancing our observability. But in the whole description, I don't see a description of the implementation architecture.
If the way to achieve this is to use SQL to do aggregate statistics in the database to get the results of these indicators. This implementation is not accepted. This can have a devastating effect on the database load and directly affect the scheduling stability. I'm strongly -1 on this way.
My suggestion is to use Prometheus for metrics, grafana for presentation, and DS to embed grafana pages in the frontend to ensure unity. This is very low intrusion for DS, while also taking into account performance and scalability. For the new indicators in the future, only grafana dashboard needs to be modified, and there is no need to make too many modifications to DS.
我认为很多时候,用户不想额外增加Prometheus,只想用现有的条件来观察系统的情况。所以我觉得可以增加一个开关按钮,是否开启交给用户。
At present, the main task of the community is to build a stable, scalable and high-performance scheduling system. To achieve this goal, boundaries need to be set for new functionality. Prometheus is the most popular monitoring solution in the industry today. This is also a feature that most users expected.
However, a few users who do not want to use Prometheus and only want to use SQL to perform statistical queries on the database is not robust, scalable, and may irreparably affect the stability of the core scheduling. This is not a feature the community currently expects.
Test database resources:2C,4G Number of test database data: 1、 project num:28 2、 user num:35 3、 process definition num:1000 4、 task definition num:5800 5、 task instance num:12800 | type | interfaces | Approximate average time | sql | remark |
---|---|---|---|---|---|
numeric value | Number of projects | 100ms | SELECT COUNT(*) from ( select distinct project_id from t_ds_project p,t_ds_relation_project_user rel where p.id = rel.project_id and rel.user_id= 2 UNION ALL select distinct id from t_ds_project where user_id= 2 ) result; | ||
numeric value | Total workflows, number of online workflows | 100ms | select release_state as proc_status,count(*) as proc_count from t_ds_process_definition group by release_state; | ||
numeric value | The number of online tasks | 100ms | select count(distinct b.post_task_code) from (select user_id,project_id from t_ds_relation_project_user where user_id=2 group by user_id,project_id)a join (select id,code from t_ds_project)c on a.project_id=c.id join (select project_code,post_task_code from t_ds_process_task_relation)b on c.code = b.project_code; | ||
numeric value | The number of scheduled tasks, the number of successfully scheduled tasks, and the number of tasks that were successfully scheduled yesterday | 3.3s | select count(*) from t_ds_process_instance instance, t_ds_process_definition define, t_ds_task_instance tins, t_ds_project project where instance.schedule_time is not null and instance.process_definition_code = define.code and tins.process_instance_id = instance.id and project.code = define.project_code and instance.schedule_time > '2024-07-15 00:00:00' and instance.schedule_time < '2024-07-16 00:00:00'; | It takes about 300ms to query the number of scheduled tasks only, and about 3s to calculate the number of tasks that should be scheduled. | |
manifest | Top 5 Tasks in Running Duration | 200ms | select name, duration from ( select a.process_definition_code, AVG(timestampdiff(MINUTE,a.start_time,a.end_time)) duration from t_ds_process_instance a,t_ds_process_definition b, t_ds_project c where a.schedule_time is not null and a.process_definition_code = b.code and c.code = b.project_code and a.start_time >='2024-07-15 00:00:00' and start_time <='2024-07-16 00:00:00' group by a.process_definition_code order by duration desc, process_definition_code asc limit 5 ) tmp left join t_ds_process_definition c on c.code = tmp.process_definition_code | ||
manifest | Top 5 Failed Tasks | 200ms | select c.name, tmp.count from ( select a.process_definition_code, count(*) count from t_ds_process_instance a, t_ds_process_definition b, t_ds_project c where a.schedule_time is not null and a.process_definition_code = b.code and c.code = b.project_code and a.start_time >='2024-07-15 00:00:00' and start_time <='2024-07-16 00:00:00' group by a.process_definition_code order by count desc, process_definition_code asc limit 5 ) tmp left join t_ds_process_definition c on tmp.process_definition_code = c.code; | ||
Trends | Task status trends | 200ms | select name, hh, sum(sum ) as value from ( select n.id, case state when 1 then '正在运行' when 5 then '停止' when 6 then '失败' when 7 then '成功' else '其他' end as name, m. from ( select j.project_code, j.hh, j.state, sum(j.cnt) as sum from ( select b.project_code, a.state, a.hh, a.cnt from ( select task_code, state, hh, count() cnt from( select task_code, state, hour(DATE_ADD(start_time,INTERVAL 14 HOUR)) hh from t_ds_task_instance where start_time >='2024-07-15 00:00:00' and start_time <='2024-07-16 00:00:00' ) k group by k.task_code, k.state, k.hh ) a left join t_ds_task_definition b on a.task_code = b.code ) j where j.project_code is not null group by j.project_code, j.state, j.hh ) m left join t_ds_project n on m.project_code = n.code ) p group by p.hh, p.name order by p.hh; |
A day's worth of statistics by hourly dimension. |
Search before asking
Motivation
At present, the monitoring items on the homepage of DS scheduling tasks are too simple to provide clear insights into the overall and sub project workflow, task operation status, including statistics of abnormal situations. It is planned to add relevant analysis indicators to assist administrators, data development, and frontline operations in analyzing and adjusting the execution status. There are two dimensions. The first is the overall scheduling analysis, which is aimed at cluster administrators. They need to pay attention to the number of projects currently scheduled, the number of online workflows, as well as the daily successful scheduling, the distribution of hourly level scheduling tasks, how many tasks are successfully retried, and which tasks run for a long time and fail more times around the task level. The purpose of this dimension is to enable cluster administrators to quickly determine the operation status and task distribution of the scheduling system, and provide improvement suggestions to various project developers. The second dimension is project analysis, which is aimed at the administrators of a certain project. Currently, project settings generally have a certain degree of logic, including layering or independent operation according to business scenarios. It is necessary to pay attention to the workflow situation, task situation, hourly adjustment distribution, etc. of the project. Based on the task level, it is important to consider which tasks have longer running times and more failures
Design Detail
The list of planned indicators is shown in the following figure Numerical type is presented in the form of numerical cards during the development process, with trend proportions planned through discounting or bar charts, and lists presented in the form of bar charts.
Compatibility, Deprecation, and Migration Plan
No response
Test Plan
No response
Code of Conduct