[FEA] port legacy diagnostic tool capabilities

zhanga5 commented 1 year ago

Is your feature request related to a problem? Please describe. Legacy diagnostic tool was removed by https://github.com/NVIDIA/spark-rapids-tools/pull/406, following diagnostic functions are not supported any more:

'nv_driver': dump NVIDIA driver info via command `nvidia-smi`
'cuda_version': check if CUDA toolkit major version >= 11.0
'rapids_jar': check if only single RAPIDS Accelerator for Apache Spark jar is installed and verify its signature
'deprecated_jar': check if deprecated (cudf) jar is installed. I.e. should no cudf jar starting with RAPIDS Accelerator for Apache Spark 22.08
'spark': run a Hello-world Spark Application on CPU and GPU
'perf': performance test for a Spark job between CPU and GPU
'spark_job': run a Hello-world Spark Application on CPU and GPU via Dataproc job interface
'perf_job': performance test for a Spark job between CPU and GPU via Dataproc job interface

Describe the solution you'd like Port legacy diagnostic functions to new pytools framework.

Describe alternatives you've considered Restore legacy diagnostic tool that's removed by https://github.com/NVIDIA/spark-rapids-tools/pull/406

Additional context None.

amahussein commented 1 year ago

@zhanga5 , I thought #375 covered some of those items in the list. What do I miss?

GPU hardware info (lshw, lspci)
GPU Driver version
Plugin version
CUDA Runtime version if exist
Spark version and configuration if exist

For the remaining ones:

perf: I don't see how this will be part of the diagnostic. If it is not a standard benchmark, then it is unlikely to be accepted as a performance evaluation.
spark_job: Is there a scenario in which CSP cluster won't run a spark hello-world. I mean what are the cases that would cause this to fail?
perf_job: same issue with perf
deprecated_jar: do we still need to check against that? It is supposed to be deprecated for a long time by now. Do we still have users deploying cudf.jar?

zhanga5 commented 1 year ago

@zhanga5 , I thought #375 covered some of those items in the list. What do I miss?

GPU hardware info (lshw, lspci)

GPU Driver version

Plugin version

CUDA Runtime version if exist

Spark version and configuration if exist

For the remaining ones:

perf: I don't see how this will be part of the diagnostic. If it is not a standard benchmark, then it is unlikely to be accepted as a performance evaluation.

spark_job: Is there a scenario in which CSP cluster won't run a spark hello-world. I mean what are the cases that would cause this to fail?

perf_job: same issue with perf

deprecated_jar: do we still need to check against that? It is supposed to be deprecated for a long time by now. Do we still have users deploying cudf.jar?

yeah, perhaps they're not all required at this moment. @viadea / @nvliyuan could you guys help review these functions and update previous requirements if possible?

amahussein commented 1 year ago

CCing: @mattahrens and @mattf

nvliyuan commented 1 year ago

yeah, perhaps they're not all required at this moment. @viadea / @nvliyuan could you guys help review these functions and update previous requirements if possible?

Maybe @mattf could help answer the requirement questions?

viadea commented 1 year ago

For the remaining ones:

* `perf`: I don't see how this will be part of the diagnostic. If it is not a standard benchmark, then it is unlikely to be accepted as a performance evaluation.

* `spark_job`: Is there a scenario in which CSP cluster won't run a spark hello-world. I mean what are the cases that would cause this to fail?

* `perf_job`: same issue with `perf`

* `deprecated_jar`: do we still need to check against that? It is supposed to be deprecated for a long time by now. Do we still have users deploying `cudf.jar`?

for perf or perf_job I think it is fine to remove.

for spark_job, the intention is to make sure a sample dataframe API job can work fine without any issue. Reasons could be different: such as no GPU resource discovered, misconfiguration in spark-defaults.conf, etc. I just want to make sure the cluster is healthy before customer run any other production jobs on it.

for deprecated_jar, the chance is low: such as customer put the old cudf jar inside with the latest version of spark rapids jar without noticing that. It happened once but i am fine if we want to remove this function

NVIDIA / spark-rapids-tools

[FEA] port legacy diagnostic tool capabilities #419