NVIDIA / spark-rapids-tools

User tools for Spark RAPIDS
Apache License 2.0
44 stars 34 forks source link

Fix java Qual tool Autotuner output when GPU device is missing #1085

Closed cindyyuanjiang closed 4 weeks ago

cindyyuanjiang commented 4 weeks ago

Fixes https://github.com/NVIDIA/spark-rapids-tools/issues/1030

Problem Java Qual tool Autotuner outputs inconsistent results when GPU device is not provided in worker info file (--worker-info) and platform is set to databricks-aws-t4:

Changes

Testing

export TOOLS_JAR
export EVENTLOGS
export WORKER_INFO_FILE
java -XX:+UseG1GC -Xmx50g -cp $TOOLS_JAR:$SPARK_HOME/jars/* com.nvidia.spark.rapids.tool.qualification.QualificationMain --platform databricks-aws-t4 --num-threads 6 --auto-tuner --worker-info $WORKER_INFO_FILE $EVENTLOGS

In WORKER_INFO_FILE:

system:
  numCores: 32
  memory: 131072MiB
  numWorkers: 4
softwareProperties:
  spark.scheduler.mode: FAIR
  spark.sql.cbo.enabled: 'true'
  spark.ui.port: '0'
  spark.yarn.am.memory: 640m
tgravescs commented 4 weeks ago

thanks for fixing this, it would be nice to add the new fixed output to the description.

tgravescs commented 4 weeks ago

This is likely a separate issue, but it seems odd ot me that the tool doesn't fail if we pass in a bad parameter for --platform... For instance if I pass in databricks-aws-r4 it happily goes along and just changes it to be something else -

24/06/07 07:43:57 INFO PlatformFactory: Using platform: databricks-aws

I would have expected this to error out so user knows they are not going to get what they expect. @mattahrens @amahussein thoughts on this, if you agree we can file a separate issue.

amahussein commented 4 weeks ago

This is likely a separate issue, but it seems odd ot me that the tool doesn't fail if we pass in a bad parameter for --platform... For instance if I pass in databricks-aws-r4 it happily goes along and just changes it to be something else -

24/06/07 07:43:57 INFO PlatformFactory: Using platform: databricks-aws

I would have expected this to error out so user knows they are not going to get what they expect. @mattahrens @amahussein thoughts on this, if you agree we can file a separate issue.

Yes, we can file a separate issue for that