LucaCanali / Miscellaneous

Includes notes on using Apache Spark in general, notes on using Spark for Physics, how to run TPCDS on PySpark, how to create histograms with Spark, tools for performance testing CPUs, Jupyter notebooks examples for Spark, examples for Oracle and other DB systems.
Apache License 2.0
424 stars 147 forks source link

Scaling questions #6

Closed matschaffer-roblox closed 1 year ago

matschaffer-roblox commented 1 year ago

Hey Luca!

Thanks again for your spark dashboarding work. It gave me a great leg up on implementing our own metrics solution.

One thing I'm noticing though is the spark metrics being per app-id have really high cardinality and our metrics receiver (prometheus & victoria metrics) seems to be struggling as as the number of series grows (seeing up to 30MM series per cluster in some cases).

Have you seen anything like this on your installation? Does influx maybe just handle it better?

LucaCanali commented 1 year ago

At present we use the dashboard as a opt-in option, which users normally activate only when doing troubleshooting. It's tempting to log everything, just in case, but indeed it can be quite a load on the receiving end as the metrics are many and by default metrics are logged every 10 seconds. On InfluxDB we also set retention to automatically drop metircs logged by "old Spark applications".

matschaffer-roblox commented 1 year ago

Fantastic. Thanks for the context!

matschaffer-roblox commented 1 year ago

oh, also what mechanism do you use for opt-in?

I'm thinking of basing it off the spark.metrics.conf.*.sink.graphite.prefix setting. That way a cluster can have a low-ingest default, we could override that default to get all metrics for jobs that use a given cluster, or override for a given app to get all metrics just for that app.

LucaCanali commented 1 year ago

In our case spark.metrics.conf.* parameters are only added to the Spark jobs if users opt-in to add them. In one notable case we have a web notebooks service and the configuration can be done in self-service by the users from a GUI.

matschaffer-roblox commented 1 year ago

Got it. Thanks!