NVIDIA / spark-rapids-tools

User tools for Spark RAPIDS
Apache License 2.0
50 stars 37 forks source link

[FEA] Qual tool tuning rec based on CPU event log coherently recommend tunings and node setup and infer cluster from eventlog #1160

Closed tgravescs closed 2 months ago

tgravescs commented 3 months ago

Is your feature request related to a problem? Please describe. Currently we have basic integration to be able to run qualification auto tuner against cpu event logs.. But it isn't tied to any node recommedations or other features in the python user tools so user can get incorrect or inconsistent behavior.

Note, the assumption here is that you run a similar sized GPU cluster as compared to the CPU cluster. Similar sized to me means same number of executors with similar cores/memory (might not be exact). If you change number of executors then you change the parallelism possibilities.

We need to:

  1. tie together the options from python for passing clusters,
  2. if the cluster isn't provided we need infer the cluster setup from the event log and infer the gpu cluster from that. To make this happen we need a mapping to instance type in the scala code.
  3. Make sure the recommendations for memory are really based on the instance type to make sure heap, offheap, etc all will fit.
  4. Make sure that we are recommending the correct number of nodes based on the instance type. For instance CPU might run 2 executors on a 32 core box. But for that CSP there are no GPU nodes that can have 2 gpus on it, so you have to make that 2 nodes of 16 cores each with 1gpu to keep the number of executors the same.
  5. Output in the python side should be a per app recommendation for the cluster setup and tunings unless the --cluster option is passed into the python user_tools. if --cluster passed its assumed each job ran on that type of cluster (but still has bugs in my opinion).
  6. Tie back the scala code tunings and node recommendation inferred to the final python output in the cluster shape recommendation