databrickslabs / dbldatagen

Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines
https://databrickslabs.github.io/dbldatagen
Other
309 stars 58 forks source link

Distribution functions (and perhaps others) not compatible with Databricks UC Clusters operating in `shared` mode #154

Closed kthejoker closed 1 year ago

kthejoker commented 1 year ago

Expected Behavior

  1. Set up UC-enabled Databricks interactive cluster.
  2. Use Dbldatagen to create spec using distributions ("dist") functions for data generation.

Expected result: a Dataframe with a million ints in it

Current Behavior

AnalysisException: [UC_COMMAND_NOT_SUPPORTED] UDF/UDAF functions are not supported in Unity Catalog.; Project [id#835L, cast(((round((gamma_func(1.0, 2.0, 2678957030506407010)#837 cast(99 as float)), 0) cast(1 as float)) + cast(1 as float)) as int) AS ip_address#838] +- Range (0, 1000000, step=1, splits=Some(256))

Steps to Reproduce (for bugs)

  1. Set up UC Cluster
from dbldatagen import DataGenerator
import dbldatagen.distributions as dist

shuffle_partitions_requested = 256
partitions_requested = 256
data_rows = 1 * 1000000 #50 million

# partition parameters etc.
spark.conf.set("spark.sql.shuffle.partitions", shuffle_partitions_requested)

dfDataspec = (DataGenerator(spark, rows=data_rows, partitions=partitions_requested)          
                .withColumn("ip_address", "int", minValue=1, maxValue=100, random=True, 
                            distribution=dist.Gamma(1.0,2.0)) 
              )
df = dfDataspec.build()

Context

Your Environment

kthejoker commented 1 year ago

Mostly leaving this issue here so we can test dbldatagen with UC clusters as soon as Python UDFs are supported there.

ronanstokes-db commented 1 year ago

There are 3 cluster modes available with UC enabled environments - "single user", "shared" and "no isolation shared".

The data generator works as expected with "single user" and "no isolation shared" clusters

See the following resource for a description on the cluster access modes:

https://docs.databricks.com/clusters/cluster-ui-preview.html

ronanstokes-db commented 1 year ago

This will not require any changes but will be noted in the readme

ronanstokes-db commented 1 year ago

Readme updated