[FEA] Qualification tool should recommend the cluster shape based on the best TCO according to our internal benchmark

NVIDIA / spark-rapids-tools

User tools for Spark RAPIDS

Apache License 2.0

43 stars 34 forks source link

[FEA] Qualification tool should recommend the cluster shape based on the best TCO according to our internal benchmark #1109

Open viadea opened 3 weeks ago

viadea commented 3 weeks ago

Is your feature request related to a problem? Please describe. I wish Qualification tool should recommend the cluster shape based on the best TCO according to our internal benchmark. Currently the model is using the same-cpu-core GPU instance after checking the CPU cluster shape.

Sometimes it is obviously wrong or not-the-best instance: For example in a dataproc CPU eventlog, spark.executore.cores=2. Then Q tool will recommend n1-standard-2 as workers in GPU cluster.

Describe the solution you'd like Based on the best TCO NDS benchmark run, Q tool choose the best workers. In dataproc case it should be either n1-standard-32 with 2x T4s each or g2-standard-16 with 1 L4.

### Tasks
- [ ] https://github.com/NVIDIA/spark-rapids-tools/issues/1117
- [ ] https://github.com/NVIDIA/spark-rapids-tools/pull/1138
- [ ] User-tools workflow to get estimate and recommendation per-app

tgravescs commented 2 weeks ago

so this gets really complicated. How do we know this is best TCO... I have helped a customer configure jobs on Databricks AWS where going to larger node size (which our internal benchmarks say are most performant), was much more costly vs using the smaller nodes. Do you have specific cases you can share or event logs that show using the larger node really is cost effective?

tgravescs commented 2 weeks ago

Note there are 2 issues here.

1) The qualification tool is recommending n1-standard-2 when really the hosts were n1-standard-16 or n1-standard-32. This is because we don't track the number of executors per node. We should definitely try to fix that. There are some corner cases on yarn that get difficult:

you don't have to totally fill a node with executors on yarn
yarn allows different schedulers and not all of them use all resources to schedule with. So even though spark app may ask for 16 cores yarn doesn't necessarily honor that so you could end up oversubscribing. We have no way of knowing that so I think for now we have to document this

2) Based on the best TCO NDS benchmark run, Q tool choose the best workers.

Issue #1117 is start to fixing the first issue but we need more changes on python side to do that. The second issue is need more data and heuristics for us to do prperly.

tgravescs commented 4 days ago

https://github.com/NVIDIA/spark-rapids-tools/pull/1138 is working on number 1 above.