NVIDIA / spark-rapids-tools

User tools for Spark RAPIDS
Apache License 2.0
43 stars 34 forks source link

[FEA] Qualification tool should recommend the cluster shape based on the best TCO according to our internal benchmark #1109

Open viadea opened 3 weeks ago

viadea commented 3 weeks ago

Is your feature request related to a problem? Please describe. I wish Qualification tool should recommend the cluster shape based on the best TCO according to our internal benchmark. Currently the model is using the same-cpu-core GPU instance after checking the CPU cluster shape.

Sometimes it is obviously wrong or not-the-best instance: For example in a dataproc CPU eventlog, spark.executore.cores=2. Then Q tool will recommend n1-standard-2 as workers in GPU cluster.

Describe the solution you'd like Based on the best TCO NDS benchmark run, Q tool choose the best workers. In dataproc case it should be either n1-standard-32 with 2x T4s each or g2-standard-16 with 1 L4.

### Tasks
- [ ] https://github.com/NVIDIA/spark-rapids-tools/issues/1117
- [ ] https://github.com/NVIDIA/spark-rapids-tools/pull/1138
- [ ] User-tools workflow to get estimate and recommendation per-app
tgravescs commented 2 weeks ago

so this gets really complicated. How do we know this is best TCO... I have helped a customer configure jobs on Databricks AWS where going to larger node size (which our internal benchmarks say are most performant), was much more costly vs using the smaller nodes. Do you have specific cases you can share or event logs that show using the larger node really is cost effective?

tgravescs commented 2 weeks ago

Note there are 2 issues here.

1) The qualification tool is recommending n1-standard-2 when really the hosts were n1-standard-16 or n1-standard-32. This is because we don't track the number of executors per node. We should definitely try to fix that. There are some corner cases on yarn that get difficult:

tgravescs commented 4 days ago

https://github.com/NVIDIA/spark-rapids-tools/pull/1138 is working on number 1 above.