Closed tgravescs closed 4 weeks ago
--worker-info ./worker_info-demo-gpu-cluster.yaml
Should we add that file to the repo? Perhaps inside inside tests/resources
?
Should we add that file to the repo? Perhaps inside inside tests/resources?
Sure I can add it.
I also realized I wanted to add a few more tests to the Suite so I'll do that and push some updates shortly.
fixes https://github.com/NVIDIA/spark-rapids-tools/issues/1068
This enhances heuristics around the spark.executor.memory and handles cases where the memory to core ratio is to small. It will throw an exception and not put out tunings if the core/memory ratio is to small. In the future we should just tag this and recommend the sizes.
This also adds in extra overhead since worst case we need space for both pinned memory and spill memory. It gets a little complicated since spill will use pinned memory, but if its used it will use regular off heap. So here we set things at worst case which is it needs both.
I also added in heuristics for configuring the multithreaded readers - num threads and some sizes and also the shuffle reader/writer thread pools based on the number of cores.
Most of the heuristics are based on what we saw from real customer workloads and NDS results.
Most of this testing was on CSPs, I will try to apply more to onprem later.
note most of this functionality needs the worker information passed in
--worker-info ./worker_info-demo-gpu-cluster.yaml
Example:
With the worker info:
Without the worker info: