Include number of executors per node in cluster information

parthosa commented 3 months ago

Fixes #1117. Currently for cluster information, we calculate the number of nodes correctly but do not track the number of executors per node. This can generate wrong GPU cluster recommendations because there can be multiple executors per node.

This PR adds numExecsPerNode in the cluster information output file.

Changes:

Core/Java:

Calculate numExecsPerNode as maximum number of executors in any host
Update ClusterInfo and related methods.
Add Num Executor Per Node as new field in cluster information output CSV file.
Update unit test to include case with multiple executors on single node.

Output

Cluster Information Generated from Core:

File: rapids_4_spark_qualification_output_cluster_information.json

Previously:

[ {
  "appName" : "NDS - Power Run",
  "appId" : "application_169685947xxxxx",
  "eventLogPath" : "file:/Users/psarthi/Work/event-logs/xxxxxx",
  "clusterInfo" : {
    "vendor" : "dataproc",
    "coresPerExecutor" : 16,
    "numExecutorNodes" : 4,
    "driverHost" : "xxxx-dataproc-cpu-m.c.xxxx.internal"
  }
} ]

After this fix:

[ {
  "appName" : "NDS - Power Run",
  "appId" : "application_169685947xxxxx",
  "eventLogPath" : "file:/Users/psarthi/Work/event-logs/xxxxxx",
  "clusterInfo" : {
    "vendor" : "dataproc",
    "coresPerExecutor" : 16,
    "numExecsPerNode" : 6,
    "numExecutorNodes" : 4,
    "driverHost" : "xxxx-dataproc-cpu-m.c.xxxx.internal"
  }
} ]

Follow Up

Need to investigate heuristics for calculating total CPU cores of the node.
We cannot perform coresPerNode = numExecsPerNode * coresPerExecutor since a node maybe oversubscribed.
Original issue: #1109

parthosa commented 3 months ago

do we have any tests with the dynamic allocation test you had in the description?

Added unit test for dynamic allocation with comment

tgravescs commented 3 months ago

We cannot perform coresPerNode = numExecsPerNode * coresPerExecutor since a node maybe oversubscribed.

There is no way for us to know this on some platforms like yarn where they select what it schedules by. We will for now just have to make an assumption that it isn't but document and warn user

NVIDIA / spark-rapids-tools