banzaicloud / spark-metrics

Spark metrics related custom classes and sinks (e.g. Prometheus)
Apache License 2.0
176 stars 64 forks source link

Repetitions of last metric value #84

Open g1thubhub opened 1 year ago

g1thubhub commented 1 year ago

Hello,

stoader I have a general question: After a Spark application ends, the metric in the pushgateway becomes “stale” as no more updates are pushed. This results in the following problem:

Problem

The last metric is repeatedly scraped and only stops getting plotted when the gateway server shuts down which means that the cluster needs to terminate. This is apparently by design and users on the mailing group ask for a way to avoid this behaviour, e.g. https://groups.google.com/g/prometheus-users/c/uGYUQhQAdOE/m/0ICfNNHaAQAJ There seems to be no way around this problem, the authors explicitly decided against implementing something like a metric “timeout”

Have you also observed this and do you know a way to solve this problem? Unfortunately, the pull-based approach does not seem to work for multiple executors per node on a YARN cluster

stoader commented 1 year ago

That's due to how Pushgateway works, it keeps the last metric value for a metric key forever. The only solution I see is to use a custom built Pushgateway which is compatible with the upstream one but has metrics TTL capabilities. (e.g. https://github.com/dinumathai/pushgateway)

g1thubhub commented 6 months ago

Update: I have implemented a push and pull-based approach based on VictoriaMetrics in this project: https://github.com/xonai-computing/xonai-dashboard

It is 100% PromQL compatible and the Grafana Prometheus plugin also works as does the Prometheus Python client