Open viadea opened 3 weeks ago
so this gets really complicated. How do we know this is best TCO... I have helped a customer configure jobs on Databricks AWS where going to larger node size (which our internal benchmarks say are most performant), was much more costly vs using the smaller nodes. Do you have specific cases you can share or event logs that show using the larger node really is cost effective?
Note there are 2 issues here.
1) The qualification tool is recommending n1-standard-2 when really the hosts were n1-standard-16 or n1-standard-32. This is because we don't track the number of executors per node. We should definitely try to fix that. There are some corner cases on yarn that get difficult:
yarn allows different schedulers and not all of them use all resources to schedule with. So even though spark app may ask for 16 cores yarn doesn't necessarily honor that so you could end up oversubscribing. We have no way of knowing that so I think for now we have to document this
2) Based on the best TCO NDS benchmark run, Q tool choose the best workers.
Issue #1117 is start to fixing the first issue but we need more changes on python side to do that. The second issue is need more data and heuristics for us to do prperly.
https://github.com/NVIDIA/spark-rapids-tools/pull/1138 is working on number 1 above.
Is your feature request related to a problem? Please describe. I wish Qualification tool should recommend the cluster shape based on the best TCO according to our internal benchmark. Currently the model is using the same-cpu-core GPU instance after checking the CPU cluster shape.
Sometimes it is obviously wrong or not-the-best instance: For example in a dataproc CPU eventlog, spark.executore.cores=2. Then Q tool will recommend
n1-standard-2
as workers in GPU cluster.Describe the solution you'd like Based on the best TCO NDS benchmark run, Q tool choose the best workers. In dataproc case it should be either
n1-standard-32
with 2x T4s each org2-standard-16
with 1 L4.