Guide for metrics interpretation

Hi there,

thanks for all the resources you've provided to get a better understanding of spark metrics. Especially, I found your blog post very useful.

When using sparkMeasure, I still found it hard to come up with a proper interpretation for all the resulting task metrics. There are some easy once like CPU utilization (as described in your blog post). However, many others do not seem so trivial. For example:

How to interpret sum(memoryBytesSpilled) and sum(diskBytesSpilled)? From my understanding, spilling occurs if data size exceeds available memory of an executor. memoryBytesSpilled refers to the deserialized size in memory whereas diskBytesSpilled refers to serialized size on disk for the data that is required to be spilled (see here and here). Temporarily storing data on disk is a performance penalty and should be avoided.
How to interpret shuffleBytesWritten? Does this encompass the entire data which was required to be shuffled between all executors?
How to interpret bytesread and byteswritten? Does this relate to the total amount of bytes read as input and bytes written as output (also from multiple sources/targets like HDFS, Kudu, Cassandra within the same job)?

What I'm basically aiming at is a beginners tutorial to guide new users through all the metrics with an example of what they mean and how they might be relevant for performance (e.g. spilling = bad).

For now, it would be great if you could help me better understand the 3 questions raised above. In future terms, it would be awesome to start a small guide on how to interpret spark metrics for the entire spark community (I would be in for that). Perhaps there is already an existing one but I couldn't find anything appropriate neither in the official docs nor on some personal projects.

Hi @mansenfranzen, I have read with interest your ideas and questions. Unfortunately, Spark metrics are not yet well documented (besides the short description in the monitoring documentation). Further investigations in the source code and coming up with reference workloads that can be used to clarify what a particular metric measures, may be quite good to have. These days, for Spark performance troubleshooting, I am trying to focus more on metrics that address time-based drilldown, for example to understand how much of the executor time was CPU, how much I/O how much GC, how much shuffle-related operations. etc. Workload-type metrics are also useful, of course, notably memory and I/O related ones. I find more convenient to use the Spark metrics system for that (which anyways is also linked with the metrics used here in SparkMeasure), see https://github.com/cerndb/spark-dashboard I'll be interested to know if you make further progress on a deeper understanding of the metrics. Best, L.

LucaCanali / sparkMeasure

Guide for metrics interpretation #32