LucaCanali / sparkMeasure

This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of Spark metrics, making it a practical choice for both developers and data engineers.
Apache License 2.0
690 stars 144 forks source link

How to find time spent in I/O for a task #33

Closed Niharikadutta closed 2 years ago

Niharikadutta commented 3 years ago

Thanks for this work @LucaCanali ! I had one question about I/O metrics, I know you have mentioned in the limitations sections of the README that Spark does not expose I/O and network related metrics. However I was wondering if there was any way to deduce approximately the time spent in I/O for a job given current metrics? For instance, what does the different between ExecutorRunTime and ExecutorCpuTime entail?

LucaCanali commented 3 years ago

Hi @Niharikadutta thanks for your interest in sparkMeasure. Measuring I/O is currently not easy to do with Apache Spark. Although this is an interesting problem where we can expected and hope improvements in future versions. The limitations that we see in this area with current Apache Spark version come ultimately from the Hadoop API that is used by Spark to do I/O. I share below a few items that may interest you if you want to further investigate this:

Niharikadutta commented 3 years ago

Great, thank you! I have done some experiments using the approach you laid out in point number 1. What I have seen is though that the uninstrumented time is pretty significant for almost all applications even when the workload has no I/O related tasks so was just wondering what all could be contributing to that, and you gave me hints for that so will investigate further around that. Thanks again!