NVIDIA / spark-rapids-tools

User tools for Spark RAPIDS
Apache License 2.0
44 stars 34 forks source link

User tools fallback to default zone/region #1054

Closed nartal1 closed 1 month ago

nartal1 commented 1 month ago

This fixes https://github.com/NVIDIA/spark-rapids-tools/issues/1018.

This PR fallsback to default region/zone(where applicable) for CLI command when Region/Zone is not set by the user. Earlier it would throw an error which doesn't show the exact reason though. With this PR, it continues with the default region with a warning that Region was not set and using the default values from environment variable.

In addition to it, updated the way the remaining environment variables are set in sp_types.py. Earlier condition would miss some of the environment variables.

Tested it on platform:

  1. dataproc
  2. emr
  3. databricks-aws

databricks-azure has already default defined.

Dataproc failure
 spark_rapids qualification --eventlogs=gs://< PATH TO EVENTLOGS>  --platform=dataproc 
RuntimeError: Error invoking CMD : 
        | ERROR: (gcloud.compute.machine-types.describe) Could not fetch resource:
        |  - Invalid value for field 'zone': 'None'. Must be a match of regex '[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?'
Dataproc completion with this PR
Initialization Scripts:
-----------------------
To create a GPU cluster, run the following script:

```bash
#!/bin/bash

export CLUSTER_NAME="default-cluster-name"

gcloud dataproc clusters create $CLUSTER_NAME \
    --image-version=2.1.41-debian11 \
    --region us-central1 \
    --zone us-central1-b \
    --master-machine-type n1-standard-16 \
    --num-workers 8 \
    --worker-machine-type n1-standard-64 \
    --num-worker-local-ssds 2 \
    --enable-component-gateway \
    --subnet=default \
    --initialization-actions=gs://goog-dataproc-initialization-actions-us-central1/spark-rapids/spark-rapids.sh \
    --worker-accelerator type=nvidia-tesla-t4,count=2 \
    --properties 'spark:spark.driver.memory=50g'

```
Processing Completed!
EMR failure
spark_rapids qualification --eventlogs=s3://{PATH_TO_EVENTLOGS} --platform=emr --verbose 
File "/home/test/spark-rapids-tools/user_tools/src/spark_rapids_pytools/common/utilities.py", line 333, in exec
    raise RuntimeError(f'{cmd_err_msg}')
RuntimeError: Error invoking CMD : 
        | 
        | Could not connect to the endpoint URL: "https://elasticmapreduce.None.amazonaws.com/"
EMR completion with this PR
        Instance types conversions:
------------  --  ----------
m6gd.4xlarge  to  g5.4xlarge
------------  --  ----------
To support acceleration with T4 GPUs, switch the worker node instance types

Processing Completed!
nartal1 commented 1 month ago

Since this updates region in our env_vars and not the actual CLI configuration, CLI cmds such as aws emr describe-cluster --cluster-id {cluster_id} might crash because it will try to get region from the CLI config. We might have to add the region explicitly in these CLI cmds.

Yes @parthosa ! You are correct. Region becomes mandatory when --cluster is provided as the argument. We get a better error now to set it explicitly:

Error invoking CMD <aws emr list-clusters --query 'Clusters[?Name==`emr_perfio_on_filecache_on_us_east_1a`]'>:
        |
        | You must specify a region. You can also configure your region by running "aws configure".