NVIDIA / spark-rapids-tools

User tools for Spark RAPIDS
Apache License 2.0
49 stars 35 forks source link

[BUG] Review and update recommended Cluster info in metadata json file #1239

Closed tgravescs closed 3 weeks ago

tgravescs commented 1 month ago

PR https://github.com/NVIDIA/spark-rapids-tools/pull/1216 (came from https://github.com/NVIDIA/spark-rapids-tools/issues/1143) added a qualification_summary_metadata.json file. It has a recommendedCluster field in there that I think could enhanced to be more clear to the user.

We need to talk though all the CSPs/cluster types and decide what we want that output to be.
I think it should include the platform in the metadata file as well. For instance on Databricks the executor instance type might be enough and we don't need the number of gpus, but we would want a number of worker nodes, not the number of executors. But on prem could be different since we don't know the executor instance types available and generally just use number of executors. GCP we may want to see a number of gpus and number of local ssds to use.

### Tasks
- [ ] #1258
- [ ] #1265 
tgravescs commented 1 month ago

Here are a few thoughts on a couple of the CSPs to get discussions going:

open questions on naming - use "worker" or "executor"? Or does it change based on csp?

Note you might have multiple executors per node but I think that info can live in the tunings files.

Databricks:

      "platform": "databricks-aws",
      "driverNodeType": "g5.2xlarge",  
      "workerNodeType": "g5.8xlarge", 
      "numWorkerNodes": 2,
      "gpuInfo": {
        # here gpu type is decided by worker node type so don't need... but might be useful to specify if we know it?
        "gpusPerWorker": 1
      }

Dataproc:

     "platform": "dataproc",
      "driverNodeType": "n1-standard-4",  
      "workerNodeType": "n1-standard-32", 
      "numWorkerNodes": 2,
      "gpuInfo": {
        "device": "nvidia-tesla-t4",
        "gpusPerWorker": 2
      },
      "additionalConfigs": {
        "localSsds": 2
      }

Dataproc serverless: TODO

Dataproc GKE: TODO

EMR: TODO

onPrem:

     "platform": "onprem",
      "numWorkerNodes": 2,  ? do we even want this?  There is a difference between standalone and yarn mode so this could still be useful. For yarn we also want to make sure node for the driver to go on
      "gpuInfo": {
        "gpusPerWorker": 2
      },
tgravescs commented 1 month ago

Another thing to revisit is the name of this file "qualification_summary_metadata.json". Perhaps app_metadata would be more obvious to the user..

Also similar as above output figure out what we want the SourceCluster output to be. Generally I would expect it to match the recommendedCluster output

parthosa commented 1 month ago

Thanks @tgravescs for the comments.

open questions on naming - use "worker" or "executor"? Or does it change based on csp?

  • Dataproc, DB-AWS and DB-Azure use the worker terminology while creating clusters. EMR uses core/task. We should probably use worker/workerNode/workerNodeType

Another thing to revisit is the name of this file "qualification_summary_metadata.json". Perhaps app_metadata would be more obvious to the user..

  • app_metadata.json makes sense

Note you might have multiple executors per node but I think that info can live in the tunings files.

  • Yes, metadata file should contain entries related to cluster shape. If we distinguish 'workers' and 'executors', then all executor related info could reside in the tuning files.
parthosa commented 1 month ago

Currently here are the cluster info generated in the metadata json file (ps: similar structure for source cluster)

Dataproc

"recommendedCluster": {
  "driverInstance": "n1-standard-16",
  "executorInstance": "n1-standard-32",
  "numExecutors": 4,
  "gpuInfo": {
    "device": "nvidia-tesla-t4",
    "gpuPerWorker": 1
  },
  "additionalConfig": {
    "localSsd": 2
  }
}

Databrick AWS

"recommendedCluster": {
  "driverInstance": "m6gd.xlarge",
  "executorInstance": "g5.2xlarge",
  "numExecutors": 2
}

Databricks Azure

"recommendedCluster": {
  "driverInstance": "Standard_E8ds_v4",
  "executorInstance": "Standard_NC8as_T4_v3",
  "numExecutors": 2
}

EMR

"recommendedCluster": {
  "driverInstance": "i3.2xlarge",
  "executorInstance": "g5.4xlarge",
  "numExecutors": 16
}

OnPrem and Dataproc GKE (after #1241)

"recommendedCluster": {}

Comments

For instance on Databricks the executor instance type might be enough and we don't need the number of gpus, but we would want a number of worker nodes, not the number of executors.

  • For Databricks we are showing only the executor instance type.
  • numExecutors refer to number of worker nodes (we should rename this if we decide to move with worker)

GCP we may want to see a number of gpus and number of local ssds to use.

  • We show the gpuPerWorker and localSsd to use (though its value is hard coded to '2')

But on prem could be different since we don't know the executor instance types available and generally just use number of executors.

  • We need to decide the cluster recommendation for OnPrem
tgravescs commented 1 month ago

Agree with the platform at the clusterInfo level.

I would lean towards workerNodeType, but we may also want to make this consistent. I think the rapids_4_spark_qualification_output_cluster_information.json uses numExecutorNodes. Can you check if its referred to anywhere else in actual user visible output? I'm not as concerned about it matching terminology CSPs use as its concistent in our usage.

For on prem, I think we make some assumptions on number of gpus per node and number of nodes so we should output what those are and fill in the recommendedCluster. We need to decide if we want the executorInstance to just not be there or say something like Not applicable. Its sometimes nice to always have some output there.

parthosa commented 1 month ago

Can you check if its referred to anywhere else in actual user visible output?

  • I checked the output files. We refer to numExecutors only in the rapids_4_spark_qualification_output_cluster_information and the metadata file.
  • I think the following structure looks reasonable
    "recommendedCluster": {
    "driverNodeType": "m6gd.xlarge",  
    "workerNodeType": "g5.8xlarge", 
    "numWorkerNodes": 2
    }
  • Now, executor term will always refer to spark executors
  • For OnPrem, we can have the following structure:
    "recommendedCluster": {
    "driverNodeType": "Not Applicable",
    "workerNodeType": "Not Applicable",
    "numWorkerNodes": 2,
    "gpuInfo": {
    "device": "nvidia-tesla-t4",
    "gpuPerWorker": 1
    }
    }

    Its sometimes nice to always have some output there.

  • I agree. Fields driverNodeType, workerNodeType and numWorkerNodes should be present for all platforms.

Action Items:

@tgravescs Does the above list look good?

tgravescs commented 1 month ago

looks good, thanks!

parthosa commented 1 month ago

For the follow up, I plan to improve the cluster recommendation for OnPrem and Dataproc as follows:

Part 1: Include 'calculated' gpu per worker info and 'hardcoded' SSD information in the metadata json file

File: qual_2024xxx/app_metadata.json

File Contents for `--platform dataproc`
``` { "appId": "app-20240311074805-0000", "appName": "test_app_xxxxx", "eventLog": "file:/path/to/log", "clusterInfo": { "platform": "dataproc", "sourceCluster": { "driverNodeType": "n1-standard-16", "workerNodeType": "n1-standard-8", "numWorkerNodes": 9 }, "recommendedCluster": { "driverNodeType": "n1-standard-16", "workerNodeType": "n1-standard-32", "numWorkerNodes": 9, "gpuInfo": { "device": "nvidia-tesla-t4", "gpuPerWorker": 4 }, "ssdInfo": { "numLocalSsds": 2 } } }, "estimatedGpuSpeedupCategory": "Medium", "fullClusterConfigRecommendations": "/tools-run/qual_20240805222947_F2b32E83/rapids_4_spark_qualification_output/tuning/app-20240311074805-0000.conf", "gpuConfigRecommendationBreakdown": "/tools-run/qual_20240805222947_F2b32E83/rapids_4_spark_qualification_output/tuning/app-20240311074805-0000.log" }, ```
File Contents for `--platform onprem`
``` { "appId": "app-20240311074805-0000", "appName": "test_app_xxxxx", "eventLog": "file:/path/to/log", "clusterInfo": { "platform": "onprem", "sourceCluster": { "driverNodeType": "N/A", "workerNodeType": "N/A", "numWorkerNodes": 9 }, "recommendedCluster": { "driverNodeType": "N/A", "workerNodeType": "N/A", "numWorkerNodes": 9, "gpuInfo": { "device": "L4", "gpuPerWorker": 1 } } }, "estimatedGpuSpeedupCategory": "Small", "fullClusterConfigRecommendations": "/tools-run/qual_20240805222616_FdaFabB8/rapids_4_spark_qualification_output/tuning/app-20240311074805-0000.conf", "gpuConfigRecommendationBreakdown": "/tools-run/qual_20240805222616_FdaFabB8/rapids_4_spark_qualification_output/tuning/app-20240311074805-0000.log" }, ```

Part 2: Improve Console Output to include Worker Count + Gpu Info in the Qualified Cluster Recommendation column

Console Output for `--platform onprem`
``` +----+----------------------+-------------------------+-----------------+---------------------------------------+------------------------------+-----------------------------+ | | App Name | App ID | Estimated GPU | Qualified Cluster | Full Cluster | GPU Config | | | | | Speedup | Recommendation | Config | Recommendation | | | | | Category** | | Recommendations* | Breakdown* | |----+----------------------+-------------------------+-----------------+---------------------------------------+------------------------------+-----------------------------| | 7 | test_app_xxxxxx | app-20240312004315-0000 | Large | 9 workers (with 1 L4) | app-20240312004315-0000.conf | app-20240312004315-0000.log | | 6 | test_app_xxxxxx | app-20240312004328-0000 | Medium | 50 workers (with 1 L4) | app-20240312004328-0000.conf | app-20240312004328-0000.log | | 3 | test_app_xxxxxx | app-20240311085743-0000 | Medium | 2 workers (with 1 L4) | app-20240311085743-0000.conf | app-20240311085743-0000.log | | 4 | test_app_xxxxxx | app-20240311074805-0000 | Small | 9 workers (with 1 L4) | app-20240311074805-0000.conf | app-20240311074805-0000.log | | 0 | test_app_xxxxxx | app-20240311181337-0000 | Small | 2 workers (with 1 L4) | app-20240311181337-0000.conf | app-20240311181337-0000.log | ```
Console Output for `--platform dataproc`
``` +----+----------------------+-------------------------+-----------------+-----------------------------------+------------------------------+-----------------------------+ | | App Name | App ID | Estimated GPU | Qualified Cluster | Full Cluster | GPU Config | | | | | Speedup | Recommendation | Config | Recommendation | | | | | Category** | | Recommendations* | Breakdown* | |----+----------------------+-------------------------+-----------------+-----------------------------------+------------------------------+-----------------------------| | 3 | test_app_xxxxxx | app-20240304225609-0000 | Large | 15 x n1-standard-32 (with 4 T4) | app-20240304225609-0000.conf | app-20240304225609-0000.log | | 2 | test_app_xxxxxx | app-20240312151736-0000 | Large | 2 x n1-standard-32 (with 4 T4) | app-20240312151736-0000.conf | app-20240312151736-0000.log | | 15 | test_app_xxxxxx | app-20240312033441-0000 | Large | 5 x n1-standard-64 (with 4 T4) | app-20240312033441-0000.conf | app-20240312033441-0000.log | | 0 | test_app_xxxxxx | app-20240311181337-0000 | Large | 2 x n1-standard-32 (with 4 T4) | app-20240311181337-0000.conf | app-20240311181337-0000.log | | 6 | test_app_xxxxxx | app-20240312004328-0000 | Large | 25 x n1-standard-32 (with 4 T4) | app-20240312004328-0000.conf | app-20240312004328-0000.log | | 1 | test_app_xxxxxx | app-20240311195738-0000 | Medium | 2 x n1-standard-32 (with 4 T4) | app-20240311195738-0000.conf | app-20240311195738-0000.log | | 8 | test_app_xxxxxx | app-20240312004315-0000 | Medium | 9 x n1-standard-32 (with 4 T4) | app-20240312004315-0000.conf | app-20240312004315-0000.log | ```
Console Output for `--platform dataproc-aws`
``` +----+----------------------+-------------------------+-----------------+--------------------------+------------------------------+-----------------------------+ | | App Name | App ID | Estimated GPU | Qualified Cluster | Full Cluster | GPU Config | | | | | Speedup | Recommendation | Config | Recommendation | | | | | Category** | | Recommendations* | Breakdown* | |----+----------------------+-------------------------+-----------------+--------------------------+------------------------------+-----------------------------| | 0 | test_app_xxxxxx | app-20240312151736-0000 | Medium | 2 x g5.2xlarge | app-20240312151736-0000.conf | app-20240312151736-0000.log | | 8 | test_app_xxxxxx | app-20240304225609-0000 | Medium | 45 x g5.2xlarge | app-20240304225609-0000.conf | app-20240304225609-0000.log | | 5 | test_app_xxxxxx | app-20240312004328-0000 | Small | 150 x g5.2xlarge | app-20240312004328-0000.conf | app-20240312004328-0000.log | | 6 | test_app_xxxxxx | app-20240312004315-0000 | Small | 9 x g5.2xlarge | app-20240312004315-0000.conf | app-20240312004315-0000.log | | 7 | test_app_xxxxxx | app-20240311122222-0000 | Small | 2 x g5.2xlarge | app-20240311122222-0000.conf | app-20240311122222-0000.log | | 4 | test_app_xxxxxx | app-20240311195738-0000 | Small | 2 x g5.2xlarge | app-20240311195738-0000.conf | app-20240311195738-0000.log | ```