Open cindyyuanjiang opened 6 days ago
I want to get some input on Change No. 2 in PR description: for platform with GPU device that does not have a speedup factor file, do we want the tool to fall back to a default speedup factor file? This is a bit contradicting with the current implementation because the tool is supposed to only support platforms under getAllNames.
cc: @amahussein @tgravescs
After this PR: The tool will raise an error about the unsupported GPU device and skips the rest of processing.
What does user see in the final output?
Platform with supported GPU device, but the combination of Platform and GPU does not have existing speedup factor files, e.g. databricks-aws-l4.
If we are switching to use qualx then we shouldn't be using speedup factors anyway so that to me is lower priority so I think using what we have is fine but make it obvious to the user that is what happened. Ideally we stop doing speedup factor calculations
@tgravescs Thanks for the feedback! I have included the output for different cases in description.
Fixes https://github.com/NVIDIA/spark-rapids-tools/issues/1028
A valid platform argument consists of two parts:
Platform
name and optionalGPU
name. For example:In this PR, we are discussing the case where
Platform
name is valid, butGPU
name may be corrupted, because otherwise the tool will have detected it and raised an error.Changes
This PR handles different scenarios of input
--platform
argument:1. Platform with unsupported GPU device, e.g.
databricks-aws-r4
Before this PR: The implementation will extract
r4
as the GPU device, but since it is not in the GPU device map, the tool will usedatabricks-aws
as the platform to proceed with running the Qual tool.Stdout
After this PR: The tool will raise an error about the unsupported GPU device and skips the rest of processing.
Stdout
2. Platform with supported GPU device, but the combination of Platform and GPU does not have existing speedup factor files, e.g.
databricks-aws-l4
.Before this PR: The tool is unable to find the corresponding speedup factor file, and runs into
NullPointerException
.Stdout
After this PR: The tool prints a message that there is no speedup factor for this platform and will use a default speedup factor file. E.g.
databricks-aws-l4
will usedatabricks-aws-t4
file.Stdout