LucaCanali / sparkMeasure

This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of Spark metrics, making it a practical choice for both developers and data engineers.
Apache License 2.0
706 stars 145 forks source link

Better to have the IO metrics for non-hdfs type such as S3 Storage #31

Closed jack1981 closed 2 years ago

jack1981 commented 4 years ago

We are using S3 compatible Object Storage for Spark Storage but the current default IO metrics support hdfs only.

SELECT non_negative_derivative("value", 1s) FROM "filesystem.hdfs.read_bytes" WHERE "applicationid" = '$ApplicationId' AND $timeFilter GROUP BY process

Is there anyway can able to fetch other distribute file system IO metrics ?

Thanks !

LucaCanali commented 4 years ago

Hi @jack1981, I believe your question fits better in the context of spark-dashboard implementation with the Spark metrics system, as described in https://github.com/LucaCanali/Miscellaneous/tree/master/Spark_Dashboard and in https://github.com/cerndb/spark-dashboard In that case, I'd liek to share that I have been working on extensions of Spark monitoring to cover S3A and other I/O and OS metrics, for the case of Spark 3.0, please see the work https://github.com/cerndb/SparkPlugins I'll be interested in collecting the feedback. Best, L.