LucaCanali / sparkMeasure

This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of Spark metrics, making it a practical choice for both developers and data engineers.
Apache License 2.0
690 stars 144 forks source link

Guide for metrics interpretation #32

Closed mansenfranzen closed 2 years ago

mansenfranzen commented 4 years ago

Hi there,

thanks for all the resources you've provided to get a better understanding of spark metrics. Especially, I found your blog post very useful.

When using sparkMeasure, I still found it hard to come up with a proper interpretation for all the resulting task metrics. There are some easy once like CPU utilization (as described in your blog post). However, many others do not seem so trivial. For example:

What I'm basically aiming at is a beginners tutorial to guide new users through all the metrics with an example of what they mean and how they might be relevant for performance (e.g. spilling = bad).

For now, it would be great if you could help me better understand the 3 questions raised above. In future terms, it would be awesome to start a small guide on how to interpret spark metrics for the entire spark community (I would be in for that). Perhaps there is already an existing one but I couldn't find anything appropriate neither in the official docs nor on some personal projects.

LucaCanali commented 4 years ago

Hi @mansenfranzen, I have read with interest your ideas and questions. Unfortunately, Spark metrics are not yet well documented (besides the short description in the monitoring documentation). Further investigations in the source code and coming up with reference workloads that can be used to clarify what a particular metric measures, may be quite good to have. These days, for Spark performance troubleshooting, I am trying to focus more on metrics that address time-based drilldown, for example to understand how much of the executor time was CPU, how much I/O how much GC, how much shuffle-related operations. etc. Workload-type metrics are also useful, of course, notably memory and I/O related ones. I find more convenient to use the Spark metrics system for that (which anyways is also linked with the metrics used here in SparkMeasure), see https://github.com/cerndb/spark-dashboard I'll be interested to know if you make further progress on a deeper understanding of the metrics. Best, L.