Open EricGao888 opened 2 years ago
Hi:
I think it's better to including the number of threads related to the execution of the worker and master in the monitoring.
I just updated the google doc in the Use Case
section, taking some metrics into consideration.
Another thing I propose we could think about is the granularity of metrics. I find current metrics are general statistics. Statistics of tasks and workflows are separated. We may need some metric like task.duration.<workflow_id>.<task_id>
to monitor vital workflows/tasks more accurately. Of course, a side-effect is we will generate explosive number of metrics, leading to some performance issue. To avoid this, two methods will work:
Besides, we need some descriptions for exiting metrics in official docs. #9441
@EricGao888 Hi, I close #5255, since there is already a module dolphinscheduler-meter can expose the metrics, and I will take part in this work to provide some common method.
I think this issue is worth DSIP
label. WDYT? @zhongjiajie
@devosend Hello, may I ask whether it is possible to include the three PRs of stage I in beta-2
? In this way, we could get feedback from users in advance and resolve more potential issues before 3.0.0-stable
. WDYT
I think this issue is worth
DSIP
label. WDYT? @zhongjiajie
Agrees with that, we should add DSIP for this
@EricGao888 Could you follow the https://dolphinscheduler.apache.org/en-us/community/DSIP.html guide to make it like DSIP?
@EricGao888 Could you follow the https://dolphinscheduler.apache.org/en-us/community/DSIP.html guide to make it like DSIP?
Oh, I remenber you already discuss with an e-mail about the monitoring in https://lists.apache.org/thread/6sogjh6k7f2hv954mhn24c94l2mzwgsz, maybe you should append some words and tell users we want to covert it to DSIP now
@devosend Hello, may I ask whether it is possible to include the three PRs of stage I in
beta-2
? In this way, we could get feedback from users in advance and resolve more potential issues before3.0.0-stable
. WDYT
It's a good idea. But beta-2
is mainly to fix bugs and email has been sent. So I think we can release it in beta-3
.
@EricGao888 Could you follow the https://dolphinscheduler.apache.org/en-us/community/DSIP.html guide to make it like DSIP?
Oh, I remenber you already discuss with an e-mail about the monitoring in https://lists.apache.org/thread/6sogjh6k7f2hv954mhn24c94l2mzwgsz, maybe you should append some words and tell users we want to covert it to DSIP now
@zhongjiajie Sure, I will walk through the guide and add some follow-ups in the previous email thread : )
@devosend Hello, may I ask whether it is possible to include the three PRs of stage I in
beta-2
? In this way, we could get feedback from users in advance and resolve more potential issues before3.0.0-stable
. WDYTIt's a good idea. But
beta-2
is mainly to fix bugs and email has been sent. So I think we can release it inbeta-3
.
@devosend Make sense to me. In that case, I'd better finish Stage II before beta-3
release. Thx for the information~
@SbloodyS Sorry, I mistakenly clicked the unassign
button. Could u plz reassign it to me? Thx! 🤣
@SbloodyS Sorry, I mistakenly clicked the
unassign
button. Could u plz reassign it to me? Thx! 🤣
Done.
I think we can make a grafana dashboard template in https://grafana.com/grafana/dashboards/
for users to use directly. So that we can reduce user use cost and learning cost, and users can also transform based on template.
I think we can make a grafana dashboard template in
https://grafana.com/grafana/dashboards/
for users to use directly. So that we can reduce user use cost and learning cost, and users can also transform based on template.
I will update the docs so that users could find metrics-related docs easily.
I think we can make a grafana dashboard template in
https://grafana.com/grafana/dashboards/
for users to use directly. So that we can reduce user use cost and learning cost, and users can also transform based on template.
@SbloodyS I just opened an issue for the comment above. https://github.com/apache/dolphinscheduler/issues/10582
I will submit a PR to add some more metrics related to task resource and alert server sometime this week.
I will submit a PR to add some more metrics related to task resource and alert server sometime this week.
Great Job.
FYI, Prometheus Pushgateway is also supported by Micrometer
:
https://docs.spring.io/spring-boot/docs/current/reference/htmlsingle/#actuator.metrics.export.prometheus
BTW, the StatsD registry
eagerly pushes metrics over UDP to a StatsD agent:
https://docs.spring.io/spring-boot/docs/current/reference/htmlsingle/#actuator.metrics.export.statsd
For some metrics generated (built) during runtime, these two approaches may work.
Looks like some PRs related to metrics has not been cherry-picked to 3.0.0-prepare. What about picks them when #10867 merged? @ruanwenjun @caishunfeng @zhongjiajie Thx~
Looks like some PRs related to metrics has not been cherry-picked to 3.0.0-prepare. What about picks them when #10867 merged? @ruanwenjun @caishunfeng @zhongjiajie Thx~
I think it's better put into next version, because we are about to release 3.0.0-release, during this time, we only hope to cherry-pick the pr of bugfix.
Looks like some PRs related to metrics has not been cherry-picked to 3.0.0-prepare. What about picks them when #10867 merged? @ruanwenjun @caishunfeng @zhongjiajie Thx~
I think it's better put into next version, because we are about to release 3.0.0-release, during this time, we only hope to cherry-pick the pr of bugfix.
Sure, make sense to me. Thx~
Search before asking
Description
Choose good tools, Back home early. Use Right Scheduler, Sleep Tight.
we need richer metrics to increase monitoring ability and give our users better experience using Dolphinscheduler, especially in production environment.Use case
Description
section happen, we could take three steps:Action Items
Stage I
Stage II
Micrometer
besidesPrometheus
, such asCloudWatch
,Datadog
,StatsD
,Influx
,JMX
,Elastic
, etc. For a full list, visit MicrometerSetup
section. In addition, to provide users with smooth experience, we should add docker yaml files for each exporter for the demo purpose.Stage III
Related issues
related: #5255
Are you willing to submit a PR?
Code of Conduct