NVIDIA / spark-rapids-tools

User tools for Spark RAPIDS
Apache License 2.0
49 stars 36 forks source link

Include number of executors per node in cluster information #1119

Closed parthosa closed 3 months ago

parthosa commented 3 months ago

Fixes #1117. Currently for cluster information, we calculate the number of nodes correctly but do not track the number of executors per node. This can generate wrong GPU cluster recommendations because there can be multiple executors per node.

This PR adds numExecsPerNode in the cluster information output file.

Changes:

Core/Java:

Output

Cluster Information Generated from Core:

File: rapids_4_spark_qualification_output_cluster_information.json

Previously:

[ {
  "appName" : "NDS - Power Run",
  "appId" : "application_169685947xxxxx",
  "eventLogPath" : "file:/Users/psarthi/Work/event-logs/xxxxxx",
  "clusterInfo" : {
    "vendor" : "dataproc",
    "coresPerExecutor" : 16,
    "numExecutorNodes" : 4,
    "driverHost" : "xxxx-dataproc-cpu-m.c.xxxx.internal"
  }
} ]

After this fix:

[ {
  "appName" : "NDS - Power Run",
  "appId" : "application_169685947xxxxx",
  "eventLogPath" : "file:/Users/psarthi/Work/event-logs/xxxxxx",
  "clusterInfo" : {
    "vendor" : "dataproc",
    "coresPerExecutor" : 16,
    "numExecsPerNode" : 6,
    "numExecutorNodes" : 4,
    "driverHost" : "xxxx-dataproc-cpu-m.c.xxxx.internal"
  }
} ]

Follow Up

parthosa commented 3 months ago

do we have any tests with the dynamic allocation test you had in the description?

Added unit test for dynamic allocation with comment

tgravescs commented 3 months ago

We cannot perform coresPerNode = numExecsPerNode * coresPerExecutor since a node maybe oversubscribed.

There is no way for us to know this on some platforms like yarn where they select what it schedules by. We will for now just have to make an assumption that it isn't but document and warn user