Nike-Inc / brickflow

Pythonic Programming Framework to orchestrate jobs in Databricks Workflow
https://engineering.nike.com/brickflow/
Apache License 2.0
187 stars 44 forks source link

[BUG] Cannot set data_security_mode to LEGACY_SINGLE_USER_STANDARD #77

Closed sidharth-shridhar closed 9 months ago

sidharth-shridhar commented 9 months ago

Describe the bug Unable to set property data_security_mode to LEGACY_SINGLE_USER_STANDARD while creating a new cluster. Getting the following error:

pydantic.error_wrappers.ValidationError: 1 validation error for JobsJobClusters
new_cluster -> data_security_mode
  unexpected value; permitted: 'SINGLE_USER', 'USER_ISOLATION', 'NONE' (type=value_error.const; given=LEGACY_SINGLE_USER_STANDARD; permitted=('SINGLE_USER', 'USER_ISOLATION', 'NONE'))

This policy is required to be set for legacy workspaces, otherwise one is not able to save the datasets to hive_metastore.

To Reproduce Steps to reproduce the behavior:

  1. Create a standard workflow using brickflow

  2. define a job_cluster as below:

    job_cluster = Cluster(
    name="job_cluster",
    node_type_id="r6gd.4xlarge",
    spark_version="12.2.x-scala2.12",
    min_workers=1,
    max_workers=8,
    data_security_mode="LEGACY_SINGLE_USER_STANDARD",
    policy_id="ADD_UR_POLICY_ID",
    )

    4.Try to deploy the workflow to the desired databricks WS: bf projects synth --project <project_name>

  3. See error:

    pydantic.error_wrappers.ValidationError: 1 validation error for JobsJobClusters
    new_cluster -> data_security_mode
    unexpected value; permitted: 'SINGLE_USER', 'USER_ISOLATION', 'NONE' (type=value_error.const; given=LEGACY_SINGLE_USER_STANDARD; permitted=('SINGLE_USER', 'USER_ISOLATION', 'NONE'))

    Expected behavior One should be able to deploy workflows without any issues with cluster having data_security_mode=LEGACY_SINGLE_USER_STANDARD.

Cloud Information

Desktop (please complete the following information):

stikkireddy commented 9 months ago

@sidharth-shridhar are you refering to these (https://docs.databricks.com/api/workspace/clusters/create):

data_security_mode string Enum: "NONE" "SINGLE_USER" "USER_ISOLATION" "LEGACY_TABLE_ACL" "LEGACY_PASSTHROUGH" "LEGACY_SINGLE_USER" Data security mode decides what data governance model to use when accessing data from a cluster.

NONE: No security isolation for multiple users sharing the cluster. Data governance features are not available in this mode. SINGLE_USER: A secure cluster that can only be exclusively used by a single user specified in single_user_name. Most programming languages, cluster features and data governance features are available in this mode. USER_ISOLATION: A secure cluster that can be shared by multiple users. Cluster users are fully isolated so that they cannot see each other's data and credentials. Most data governance features are supported in this mode. But programming languages and cluster features might be limited. LEGACY_TABLE_ACL: This mode is for users migrating from legacy Table ACL clusters. LEGACY_PASSTHROUGH: This mode is for users migrating from legacy Passthrough on high concurrency clusters. LEGACY_SINGLE_USER: This mode is for users migrating from legacy Passthrough on standard clusters.

sidharth-shridhar commented 9 months ago

@stikkireddy Yes, indeed. For certain workspaces that are not unit-catalog enabled, the job cluster data_security_mode should be set to "LEGACY_SINGLE_USER_STANDARD" in order to save datasets to hive_metastore. ref: issue

Currently, https://github.com/Nike-Inc/brickflow/blob/main/brickflow/bundles/model.py has type check enabled to only 'SINGLE_USER', 'USER_ISOLATION', 'NONE'

Can we add others including LEGACY_SINGLE_USER_STANDARD