NVIDIA / spark-rapids-tools

User tools for Spark RAPIDS
Apache License 2.0
49 stars 36 forks source link

Recommended cluster should use executors_per_node and cores_per_executor #1138

Closed amahussein closed 1 month ago

amahussein commented 3 months ago

Signed-off-by: Ahmed Hussein (amahussein) a@ahussein.me

This is a follow-up on #1119 and Fixes #1133

This change address a bug with the Dataproc recommended cluster shape. To reproduce the use case, there should be a single eventlog and no --cluster argument. Example:

spark_rapids qualification \
  --platform dataproc \
  --output_folder ~/output_folder \
  --eventlogs ~/eventlogs/app-xyz \
  --verbose

The instance type will be set after multiplying numExecutorCores by numExecutorsPerNode. If the result does not match a valid machine-type, then the number-cores will be adjusted to pick the next valid higher number of cores. The following log message indicates that the number of cores was adjusted.

2024-06-26 10:56:57,709 INFO rapids.tools.csp: The number of cores 80 is not in the list of supported cores. Adjusting it to the nearest CSP machine with higher number of cores n1-standard-96
tgravescs commented 3 months ago

This fix only works for Dataproc when the cluster argument is not in the CLI. Please explain more

amahussein commented 3 months ago

This fix only works for Dataproc when the cluster argument is not in the CLI. Please explain more

Thanks @tgravescs I updated the PR description

tgravescs commented 3 months ago

so I was thinking some more about this and I actually dont' think this is what we want in all cases. This would be platform dependent... meaning this might be fine on dataproc as long as the hosts we recommend can have multiple GPUs, but on like AWS we would actually want to recommend having more nodes (assuming each one has 1 gpu) in order to keep the number of executors the same. Now this has ramifications on the cost, but I think the initial thing is to keep the performance about equivalent and if we just say get node with twice the cores but only 1 GPU is going to affect our parallelism.

amahussein commented 3 months ago

@tgravescs Are you fine with the latest changes in this PR?

amahussein commented 3 months ago

so I was thinking some more about this and I actually dont' think this is what we want in all cases. This would be platform dependent... meaning this might be fine on dataproc as long as the hosts we recommend can have multiple GPUs, but on like AWS we would actually want to recommend having more nodes (assuming each one has 1 gpu) in order to keep the number of executors the same. Now this has ramifications on the cost, but I think the initial thing is to keep the performance about equivalent and if we just say get node with twice the cores but only 1 GPU is going to affect our parallelism.

Yes, that's what I was thinking about at the beginning when we discussed it. The fix is related to the strategy we take to map the CPU cluster to GPU cluster.