How to find time spent in I/O for a task

Niharikadutta commented 3 years ago

Thanks for this work @LucaCanali ! I had one question about I/O metrics, I know you have mentioned in the limitations sections of the README that Spark does not expose I/O and network related metrics. However I was wondering if there was any way to deduce approximately the time spent in I/O for a job given current metrics? For instance, what does the different between ExecutorRunTime and ExecutorCpuTime entail?

LucaCanali commented 3 years ago

Hi @Niharikadutta thanks for your interest in sparkMeasure. Measuring I/O is currently not easy to do with Apache Spark. Although this is an interesting problem where we can expected and hope improvements in future versions. The limitations that we see in this area with current Apache Spark version come ultimately from the Hadoop API that is used by Spark to do I/O. I share below a few items that may interest you if you want to further investigate this:

One way you can use to estimate the time spent during I/O by Spark tasks using metrics, is to compute the difference between ExecutorRunTime and all other instrumented time-like metrics, which include ExecutorCpuTime, jvmGCTime, etc. That would leave you with the uninstrumented time. This is often (related to) I/O time, but you would have to understand your workload to make sure this assumption holds for the workload you are interested in, as uninstrumented time can have other sources, like Python UDF, CPU throttling with cgroups, etc). A subtle point worth mentioning, is that to do I/O, Spark uses Hadoop client libraries which consume CPU cycles too, so part of the measured ExecutorCpuTime is also time spent into code doing I/O.
Another way is to use Spark Metrics System instrumentation with Spark 3.0 executor plugins and custom Apache Hadoop client with I/O calls instrumentation. I describe some experimental work on this at https://github.com/cerndb/SparkPlugins + https://github.com/cerndb/spark-dashboard and https://databricks.com/session_eu20/what-is-new-with-apache-spark-performance-monitoring-in-spark-3-0

Niharikadutta commented 3 years ago

Great, thank you! I have done some experiments using the approach you laid out in point number 1. What I have seen is though that the uninstrumented time is pretty significant for almost all applications even when the workload has no I/O related tasks so was just wondering what all could be contributing to that, and you gave me hints for that so will investigate further around that. Thanks again!

LucaCanali / sparkMeasure

How to find time spent in I/O for a task #33