Closed matschaffer-roblox closed 1 year ago
At present we use the dashboard as a opt-in option, which users normally activate only when doing troubleshooting. It's tempting to log everything, just in case, but indeed it can be quite a load on the receiving end as the metrics are many and by default metrics are logged every 10 seconds. On InfluxDB we also set retention to automatically drop metircs logged by "old Spark applications".
Fantastic. Thanks for the context!
oh, also what mechanism do you use for opt-in?
I'm thinking of basing it off the spark.metrics.conf.*.sink.graphite.prefix
setting. That way a cluster can have a low-ingest default, we could override that default to get all metrics for jobs that use a given cluster, or override for a given app to get all metrics just for that app.
In our case spark.metrics.conf.* parameters are only added to the Spark jobs if users opt-in to add them. In one notable case we have a web notebooks service and the configuration can be done in self-service by the users from a GUI.
Got it. Thanks!
Hey Luca!
Thanks again for your spark dashboarding work. It gave me a great leg up on implementing our own metrics solution.
One thing I'm noticing though is the spark metrics being per app-id have really high cardinality and our metrics receiver (prometheus & victoria metrics) seems to be struggling as as the number of series grows (seeing up to 30MM series per cluster in some cases).
Have you seen anything like this on your installation? Does influx maybe just handle it better?