NVIDIA / spark-rapids-tools

User tools for Spark RAPIDS
Apache License 2.0
44 stars 34 forks source link

[Bug] Fix java Qual tool handling of `--platform` argument #1161

Open cindyyuanjiang opened 6 days ago

cindyyuanjiang commented 6 days ago

Fixes https://github.com/NVIDIA/spark-rapids-tools/issues/1028

A valid platform argument consists of two parts: Platform name and optional GPU name. For example:

In this PR, we are discussing the case where Platform name is valid, but GPU name may be corrupted, because otherwise the tool will have detected it and raised an error.

Changes

This PR handles different scenarios of input --platform argument:

1. Platform with unsupported GPU device, e.g. databricks-aws-r4

Before this PR: The implementation will extract r4 as the GPU device, but since it is not in the GPU device map, the tool will use databricks-aws as the platform to proceed with running the Qual tool.

Stdout
24/07/02 17:08:58 INFO PlatformFactory: Using platform: databricks-aws
24/07/02 17:08:58 INFO PluginTypeChecker: Reading operators scores with platform: databricks-aws

After this PR: The tool will raise an error about the unsupported GPU device and skips the rest of processing.

Stdout
24/07/02 17:04:44 ERROR QualificationMain: Error creating the platform
java.lang.IllegalArgumentException: Unsupprted GPU device: r4
    at com.nvidia.spark.rapids.tool.PlatformFactory$.createInstance(Platform.scala:290)
    at com.nvidia.spark.rapids.tool.qualification.QualificationMain$.mainInternal(QualificationMain.scala:67)
    at com.nvidia.spark.rapids.tool.qualification.QualificationMain$.main(QualificationMain.scala:35)
    at com.nvidia.spark.rapids.tool.qualification.QualificationMain.main(QualificationMain.scala)

2. Platform with supported GPU device, but the combination of Platform and GPU does not have existing speedup factor files, e.g. databricks-aws-l4.

Before this PR: The tool is unable to find the corresponding speedup factor file, and runs into NullPointerException.

Stdout
24/07/02 17:09:33 INFO PlatformFactory: Using platform: databricks-aws-l4
24/07/02 17:09:33 INFO PluginTypeChecker: Reading operators scores with platform: databricks-aws-l4
Exception in thread "main" java.lang.NullPointerException
    at scala.io.Source$.$anonfun$fromInputStream$2(Source.scala:172)
    at scala.io.Source.close(Source.scala:368)
    at com.nvidia.spark.rapids.tool.qualification.PluginTypeChecker.readOperators(PluginTypeChecker.scala:205)
    at com.nvidia.spark.rapids.tool.qualification.PluginTypeChecker.readOperatorsScore(PluginTypeChecker.scala:129)
    at com.nvidia.spark.rapids.tool.qualification.PluginTypeChecker.(PluginTypeChecker.scala:100)
    at com.nvidia.spark.rapids.tool.qualification.QualificationMain$.mainInternal(QualificationMain.scala:77)
    at com.nvidia.spark.rapids.tool.qualification.QualificationMain$.main(QualificationMain.scala:35)
    at com.nvidia.spark.rapids.tool.qualification.QualificationMain.main(QualificationMain.scala)

After this PR: The tool prints a message that there is no speedup factor for this platform and will use a default speedup factor file. E.g. databricks-aws-l4 will use databricks-aws-t4 file.

Stdout
24/07/02 17:06:49 INFO PluginTypeChecker: Reading operators scores with platform: databricks-aws-l4
24/07/02 17:06:49 WARN PluginTypeChecker: Unable to read operator scores from file: operatorsScore-databricks-aws-l4.csv
24/07/02 17:06:49 INFO PluginTypeChecker: Using default operator scores file: operatorsScore-databricks-aws-t4.csv
cindyyuanjiang commented 6 days ago

I want to get some input on Change No. 2 in PR description: for platform with GPU device that does not have a speedup factor file, do we want the tool to fall back to a default speedup factor file? This is a bit contradicting with the current implementation because the tool is supposed to only support platforms under getAllNames.

cc: @amahussein @tgravescs

tgravescs commented 6 days ago

After this PR: The tool will raise an error about the unsupported GPU device and skips the rest of processing.

What does user see in the final output?

Platform with supported GPU device, but the combination of Platform and GPU does not have existing speedup factor files, e.g. databricks-aws-l4.

If we are switching to use qualx then we shouldn't be using speedup factors anyway so that to me is lower priority so I think using what we have is fine but make it obvious to the user that is what happened. Ideally we stop doing speedup factor calculations

cindyyuanjiang commented 5 days ago

@tgravescs Thanks for the feedback! I have included the output for different cases in description.