NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
798 stars 232 forks source link

[FEA] Expose GPU hardware attributes in spark event log #997

Open btong04 opened 4 years ago

btong04 commented 4 years ago

Is your feature request related to a problem? Please describe. We're interested in data mining spark event logs for benchmarking purposes. But the GPU name is not exposed so it's difficult to determine what hardware is present. Our parser reads the top 4 rows and the last row of the event log file to obtain relevant fields related to configurations and total run time.

Describe the solution you'd like It would be great if something like "spark.gpu." can be added (with attributes like name, memory, cores, cudaVersion, etc.) under "Event: SparkListenerEnvironmentUpdate" >> "Spark Properties": "Spark Properties":{"spark.gpu.driver.name":"Tesla T4", "spark.gpu.driver.memory":"16g", "spark.gpu.driver.driverVersion": "450.11", "spark.gpu.driver.cudaVersion":"11.0","spark.gpu.driver.cores":"2560", "spark.gpu.executor.name":"Tesla V100", "spark.gpu.executor.memory":"16g", "spark.gpu.executor.driverVersion": "450.11", "spark.gpu.executor.cudaVersion":"11.0","spark.gpu.executor.cores":"5120"}

The driver GPU may be different than the executors, or perhaps no GPU at all.

Describe alternatives you've considered One workaround would be to pass strings through the "App Name" field to be parsed in post processing, but it would require code modifications so it wouldn't be applicable on generic workloads.

tgravescs commented 4 years ago

spark properties are for configuration of spark, it's not really for state information on the nodes you run on. there is the runtime information, but that is easier because it should be global. The data you are asking for here could be different per executor. Executors can come and go with dynamic allocation and failures. Really this is more of a spark generic feature request because you could want this for cpu type disk type, network type, etc.. There are some monitoring things spark has with metrics but none that cover this.

This could also potentially be done in the executor plugins if you have somewhere to report it too and it wouldn't be the event log.

Unfortunately not a real easy way to do this right now and ideally would probably be part of the executor registration information