apache / dolphinscheduler

Apache DolphinScheduler is the modern data orchestration platform. Agile to create high performance workflow with low-code
https://dolphinscheduler.apache.org/
Apache License 2.0
12.95k stars 4.65k forks source link

[DSIP-8][Metrics] Improve DolphinScheduler Monitoring #9324

Open EricGao888 opened 2 years ago

EricGao888 commented 2 years ago

Search before asking

Description

Use case

  1. List all the metrics we need classified by different parts of Dolphinscheduler, such as master, worker, api server, etc. Here's the doc link for metrics list.
  2. Apply the code in the right place and collect these metrics with our metrics-collection frame.
  3. Find a method to expose these metrics to external system. related: #5255

Action Items

Stage I

Stage II

Stage III

Related issues

related: #5255

Are you willing to submit a PR?

Code of Conduct

github-actions[bot] commented 2 years ago

Hi:

SbloodyS commented 2 years ago

I think it's better to including the number of threads related to the execution of the worker and master in the monitoring.

EricGao888 commented 2 years ago

I just updated the google doc in the Use Case section, taking some metrics into consideration.

Another thing I propose we could think about is the granularity of metrics. I find current metrics are general statistics. Statistics of tasks and workflows are separated. We may need some metric like task.duration.<workflow_id>.<task_id> to monitor vital workflows/tasks more accurately. Of course, a side-effect is we will generate explosive number of metrics, leading to some performance issue. To avoid this, two methods will work:

  1. There will be some config for users to switch on/off generating metrics.
  2. Dolphin will send those metrics in a UDP way.
EricGao888 commented 2 years ago

Besides, we need some descriptions for exiting metrics in official docs. #9441

ruanwenjun commented 2 years ago

@EricGao888 Hi, I close #5255, since there is already a module dolphinscheduler-meter can expose the metrics, and I will take part in this work to provide some common method.

SbloodyS commented 2 years ago

I think this issue is worth DSIP label. WDYT? @zhongjiajie

EricGao888 commented 2 years ago

@devosend Hello, may I ask whether it is possible to include the three PRs of stage I in beta-2? In this way, we could get feedback from users in advance and resolve more potential issues before 3.0.0-stable. WDYT

zhongjiajie commented 2 years ago

I think this issue is worth DSIP label. WDYT? @zhongjiajie

Agrees with that, we should add DSIP for this

zhongjiajie commented 2 years ago

@EricGao888 Could you follow the https://dolphinscheduler.apache.org/en-us/community/DSIP.html guide to make it like DSIP?

zhongjiajie commented 2 years ago

@EricGao888 Could you follow the https://dolphinscheduler.apache.org/en-us/community/DSIP.html guide to make it like DSIP?

Oh, I remenber you already discuss with an e-mail about the monitoring in https://lists.apache.org/thread/6sogjh6k7f2hv954mhn24c94l2mzwgsz, maybe you should append some words and tell users we want to covert it to DSIP now

devosend commented 2 years ago

@devosend Hello, may I ask whether it is possible to include the three PRs of stage I in beta-2? In this way, we could get feedback from users in advance and resolve more potential issues before 3.0.0-stable. WDYT

It's a good idea. But beta-2 is mainly to fix bugs and email has been sent. So I think we can release it in beta-3.

EricGao888 commented 2 years ago

@EricGao888 Could you follow the https://dolphinscheduler.apache.org/en-us/community/DSIP.html guide to make it like DSIP?

Oh, I remenber you already discuss with an e-mail about the monitoring in https://lists.apache.org/thread/6sogjh6k7f2hv954mhn24c94l2mzwgsz, maybe you should append some words and tell users we want to covert it to DSIP now

@zhongjiajie Sure, I will walk through the guide and add some follow-ups in the previous email thread : )

EricGao888 commented 2 years ago

@devosend Hello, may I ask whether it is possible to include the three PRs of stage I in beta-2? In this way, we could get feedback from users in advance and resolve more potential issues before 3.0.0-stable. WDYT

It's a good idea. But beta-2 is mainly to fix bugs and email has been sent. So I think we can release it in beta-3.

@devosend Make sense to me. In that case, I'd better finish Stage II before beta-3 release. Thx for the information~

EricGao888 commented 2 years ago

@SbloodyS Sorry, I mistakenly clicked the unassign button. Could u plz reassign it to me? Thx! 🤣

SbloodyS commented 2 years ago

@SbloodyS Sorry, I mistakenly clicked the unassign button. Could u plz reassign it to me? Thx! 🤣

Done.

SbloodyS commented 2 years ago

I think we can make a grafana dashboard template in https://grafana.com/grafana/dashboards/ for users to use directly. So that we can reduce user use cost and learning cost, and users can also transform based on template.

EricGao888 commented 2 years ago

I think we can make a grafana dashboard template in https://grafana.com/grafana/dashboards/ for users to use directly. So that we can reduce user use cost and learning cost, and users can also transform based on template.

I will update the docs so that users could find metrics-related docs easily.

EricGao888 commented 2 years ago

I think we can make a grafana dashboard template in https://grafana.com/grafana/dashboards/ for users to use directly. So that we can reduce user use cost and learning cost, and users can also transform based on template.

@SbloodyS I just opened an issue for the comment above. https://github.com/apache/dolphinscheduler/issues/10582

EricGao888 commented 2 years ago

I will submit a PR to add some more metrics related to task resource and alert server sometime this week.

lgcareer commented 2 years ago

I will submit a PR to add some more metrics related to task resource and alert server sometime this week.

Great Job.

EricGao888 commented 2 years ago

FYI, Prometheus Pushgateway is also supported by Micrometer: https://docs.spring.io/spring-boot/docs/current/reference/htmlsingle/#actuator.metrics.export.prometheus

BTW, the StatsD registry eagerly pushes metrics over UDP to a StatsD agent: https://docs.spring.io/spring-boot/docs/current/reference/htmlsingle/#actuator.metrics.export.statsd

For some metrics generated (built) during runtime, these two approaches may work.

EricGao888 commented 2 years ago

Looks like some PRs related to metrics has not been cherry-picked to 3.0.0-prepare. What about picks them when #10867 merged? @ruanwenjun @caishunfeng @zhongjiajie Thx~

caishunfeng commented 2 years ago

Looks like some PRs related to metrics has not been cherry-picked to 3.0.0-prepare. What about picks them when #10867 merged? @ruanwenjun @caishunfeng @zhongjiajie Thx~

I think it's better put into next version, because we are about to release 3.0.0-release, during this time, we only hope to cherry-pick the pr of bugfix.

EricGao888 commented 2 years ago

Looks like some PRs related to metrics has not been cherry-picked to 3.0.0-prepare. What about picks them when #10867 merged? @ruanwenjun @caishunfeng @zhongjiajie Thx~

I think it's better put into next version, because we are about to release 3.0.0-release, during this time, we only hope to cherry-pick the pr of bugfix.

Sure, make sense to me. Thx~