databricks / databricks-sdk-py

Databricks SDK for Python (Beta)
https://databricks-sdk-py.readthedocs.io/
Apache License 2.0
345 stars 114 forks source link

[ISSUE] `CreateCluster` is missing `data_security_mode` attribute #225

Closed judahrand closed 1 year ago

judahrand commented 1 year ago

Description We have policies in place which require data_security_mode to be set when creating a Cluster. Because this attribute is missing we cannot create clusters with the SDK.

Expected behavior One should be able to set data_security_mode when calling ClustersAPI.create.

Debug Logs The SDK logs helpful debugging information when debug logging is enabled. Set the log level to debug by adding logging.basicConfig(level=logging.DEBUG) to your program, and include the logs here.

Other Information

Additional context Add any other context about the problem here.

judahrand commented 1 year ago

@mgyucht could you please look at this issue? Does this mean that this attribute is missing from the OpenAPI spec? It definitely is accepted and used by the actually endpoint.

narquette commented 1 year ago

Agree with adding this to the Class (ClusterCreate) and related method (clusters.create). Since the edit method (clusters.edit) has data_security_mode as an argument, the workflow I use calls the create method and saves the response in a variable so I can follow the create step with an edit using the response cluster id and save arguments as the create. It is not ideal but it works.

Example:


from databricks.sdk import WorkspaceClient
from databricks.sdk.service.compute import AutoScale, AwsAttributes, AwsAvailability, ClusterSource, DataSecurityMode, \
    RuntimeEngine
import time

w = WorkspaceClient(profile='DEFAULT')

cluster_policies = [pol for pol in w.cluster_policies.list() if pol.name == 'HIPAA_intelli_curvgh']
cluster_policy = cluster_policies[0]
spark_version = '13.2.x-cpu-ml-scala2.12'

cluster_info = {
    'spark_version': spark_version,
    'autoscale': AutoScale(min_workers=2, max_workers=8),
    'autotermination_minutes': 30,
    'aws_attributes': AwsAttributes(
        availability=AwsAvailability('SPOT_WITH_FALLBACK'),
        ebs_volume_count=0,
        first_on_demand=1,
        instance_profile_arn='<add_arn_role>',
        spot_bid_price_percent=100,
        zone_id='auto'
    ),
    'cluster_name': 'Nick Cluster Copy',
    'cluster_source': ClusterSource('API'),
    'data_security_mode': DataSecurityMode('SINGLE_USER'),
    'driver_node_type_id': 'i3en.2xlarge',
    'enable_elastic_disk': True,
    'enable_local_disk_encryption': False,
    'enable_unity_catalog': True,
    'node_type_id': 'i3en.2xlarge',
    'policy_id': cluster_policy.policy_id,
    'runtime_engine': RuntimeEngine('STANDARD'),
    'single_user_name': '<add_your_user_name>',
    'spark_conf': {'spark.databricks.service.port': '8787', 'spark.databricks.service.server.enabled': 'true'},
    'spark_env_vars': None,
    'ssh_public_keys': None
}

resp = w.clusters.create(**cluster_info)

## wait until the cluster is running
while w.clusters.get(resp.response.cluster_id).state.name == 'PENDING':
    time.sleep(60)

w.clusters.edit(cluster_id=resp.response.cluster_id, **cluster_info)
nfx commented 1 year ago

@narquette w.clusters.create(..).get() should wait until cluster is properly running or fail. please update your code.

judahrand commented 1 year ago

@narquette w.clusters.create(..).get() should wait until cluster is properly running or fail. please update your code.

This is true but is also kind of weird behaviour from Databricks imo. It isn't clear that creating a cluster should also start it.

Once #227 is merged I'd argue that the obvious (though you're correct that it is unnecessary) code would be:

resp = w.clusters.create(**cluster_info)
w.clusters.ensure_cluster_is_running(resp.response.cluster_id)

But Databricks has a lot of unintuitive behaviour 🤷

nfx commented 1 year ago

This is true but is also kind of weird behaviour from Databricks imo. It isn't clear that creating a cluster should also start it.

@judahrand it's starting a cluster, yes. will need to make it clear in the documentation.

But Databricks has a lot of unintuitive behaviour 🤷

SDK docs will get improved over time. Please keep an eye on them :)

judahrand commented 1 year ago

More importantly, is this issue likely to be fixed any time soon? It isn't one that the community can help with since the OpenAPI spec isn't publicly available (I'm still somewhat unclear as to why).

mgyucht commented 1 year ago

Hi @judahrand, sorry I missed your tag. In the meantime, this field was added to the OpenAPI spec. It is included in the latest release of the SDK: https://github.com/databricks/databricks-sdk-py/blob/main/databricks/sdk/service/compute.py#L4090.

As for the OpenAPI spec, we will eventually make the spec public but have not prioritized it yet. We understand that your ability to contribute to the SDK is very limited without the spec. For now, we've primarily focused on improving the SDK development cycle for internal contributors, but over time we expect that others will be able to contribute. Thank you for your understanding.