Open zhanga5 opened 1 year ago
@zhanga5 , I thought #375 covered some of those items in the list. What do I miss?
For the remaining ones:
perf
: I don't see how this will be part of the diagnostic. If it is not a standard benchmark, then it is unlikely to be accepted as a performance evaluation.spark_job
: Is there a scenario in which CSP cluster won't run a spark hello-world. I mean what are the cases that would cause this to fail?perf_job
: same issue with perf
deprecated_jar
: do we still need to check against that? It is supposed to be deprecated for a long time by now. Do we still have users deploying cudf.jar
?@zhanga5 , I thought #375 covered some of those items in the list. What do I miss?
- GPU hardware info (lshw, lspci)
- GPU Driver version
- Plugin version
- CUDA Runtime version if exist
- Spark version and configuration if exist
For the remaining ones:
perf
: I don't see how this will be part of the diagnostic. If it is not a standard benchmark, then it is unlikely to be accepted as a performance evaluation.spark_job
: Is there a scenario in which CSP cluster won't run a spark hello-world. I mean what are the cases that would cause this to fail?perf_job
: same issue withperf
deprecated_jar
: do we still need to check against that? It is supposed to be deprecated for a long time by now. Do we still have users deployingcudf.jar
?
yeah, perhaps they're not all required at this moment. @viadea / @nvliyuan could you guys help review these functions and update previous requirements if possible?
CCing: @mattahrens and @mattf
yeah, perhaps they're not all required at this moment. @viadea / @nvliyuan could you guys help review these functions and update previous requirements if possible?
Maybe @mattf could help answer the requirement questions?
For the remaining ones:
* `perf`: I don't see how this will be part of the diagnostic. If it is not a standard benchmark, then it is unlikely to be accepted as a performance evaluation. * `spark_job`: Is there a scenario in which CSP cluster won't run a spark hello-world. I mean what are the cases that would cause this to fail? * `perf_job`: same issue with `perf` * `deprecated_jar`: do we still need to check against that? It is supposed to be deprecated for a long time by now. Do we still have users deploying `cudf.jar`?
for perf
or perf_job
I think it is fine to remove.
for spark_job
, the intention is to make sure a sample dataframe API job can work fine without any issue. Reasons could be different: such as no GPU resource discovered, misconfiguration in spark-defaults.conf, etc.
I just want to make sure the cluster is healthy before customer run any other production jobs on it.
for deprecated_jar
, the chance is low: such as customer put the old cudf jar inside with the latest version of spark rapids jar without noticing that. It happened once but i am fine if we want to remove this function
Is your feature request related to a problem? Please describe. Legacy diagnostic tool was removed by https://github.com/NVIDIA/spark-rapids-tools/pull/406, following diagnostic functions are not supported any more:
Describe the solution you'd like Port legacy diagnostic functions to new pytools framework.
Describe alternatives you've considered Restore legacy diagnostic tool that's removed by https://github.com/NVIDIA/spark-rapids-tools/pull/406
Additional context None.