databricks / databricks-sdk-py

Databricks SDK for Python (Beta)
https://databricks-sdk-py.readthedocs.io/
Apache License 2.0
318 stars 103 forks source link

[ISSUE] clusters.create_and_wait not accepting dict-input from configuration file. #690

Open tseader opened 1 week ago

tseader commented 1 week ago

Description I am attempting to create a compute cluster using the Python SDK while sourcing a cluster-create configuration JSON file, which is how it's done for the databricks-cli (e.g. databricks clusters create --json @my/path/to/cluster-create.json) and what databricks provides through the GUI. Reading in the JSON as a Dict fails due to the assumption in the SDK that the arguments are of specific DataClass types, e.g.:

>       if autoscale is not None: body['autoscale'] = autoscale.as_dict()
E       AttributeError: 'dict' object has no attribute 'as_dict'

Reproduction

Trimmed down cluster-create example JSON config:

{
    "cluster_name": "databricks-poc",
    "spark_version": "15.3.x-scala2.12",
    "spark_conf": {},
    "gcp_attributes": {
        "use_preemptible_executors": false,
        "availability": "ON_DEMAND_GCP",
        "zone_id": "auto"
    },
    "node_type_id": "e2-highmem-2",
    "spark_env_vars": {},
    "autotermination_minutes": 60,
    "enable_elastic_disk": true,
    "data_security_mode": "USER_ISOLATION",
    "runtime_engine": "STANDARD",
    "autoscale": {
        "min_workers": 1,
        "max_workers": 1
    }
}
from databricks.sdk import WorkspaceClient

db_client = WorkspaceClient()
with open("my/path/to/cluster-create.json") as file:
    create_config = json.load(file)
db_client.clusters.create_and_wait(**create_config)

Expected behavior I expect by passing in the dict of the cluster configuration, the SDK would handle casting. Maybe not this method, but perhaps another method created to do similar.

Is it a regression? No

Debug Logs N/A I don't think.

Additional context I can work through this by implementing my own custom solution by working through casting to the appropriate data classes, but I'm hoping maybe I'm just missing the pattern or this pattern is helpful for more than just me.

tseader commented 1 week ago

Here's what I came up with to get around the situation:

def create_compute_cluster(db_client: WorkspaceClient, cluster_conf: dict) -> str:
    cc = CreateCluster.from_dict(cluster_conf)
    refactored_input = dict()
    for field in list(cc.__dataclass_fields__.keys()):
        refactored_input[field] = cc.__getattribute__(field)
    return db_client.clusters.create_and_wait(**refactored_input, timeout=CLUSTER_UP_TIMEOUT)

I could also see the function reading the json file more like this:

def create_compute_cluster(db_client: WorkspaceClient, create_config_path: dict) -> str:
    with open(create_config_path) as file:
        create_config = json.load(file)
    cc = CreateCluster.from_dict(create_config)
    refactored_input = dict()
    for field in list(cc.__dataclass_fields__.keys()):
        refactored_input[field] = cc.__getattribute__(field)
    return db_client.clusters.create_and_wait(**refactored_input, timeout=CLUSTER_UP_TIMEOUT)

What may make sense is some additional functions in ClustersAPI class unless overloading is preferred using multipledispatch. All this assumes there's a need outside my own to do this type of pattern. 🤷