databrickslabs / dbldatagen

Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines
https://databrickslabs.github.io/dbldatagen
Other
363 stars 61 forks source link

Random number generation not generating random data unless `maxValue` is specified or is implied from other options #259

Open ronanstokes-db opened 7 months ago

ronanstokes-db commented 7 months ago

Expected Behavior

When you want to generate a random value for a field, you use the option random=True.

Current Behavior

This currently only works if an upper bound (i.e max value) is specified for the column. Upper bounds are implicitly calculated when using the values option, the uniqueValues option also.

The workaround in the current release is to always specify an upper bound using either the maxValue option, the uniqueValues option or other options such as values that implicitly compute an upper bound for the range of values produced.

Steps to Reproduce (for bugs)

The following code works correctly generating random data on all columns marked as random except for customer_id2

testDataSpec = (
    dg.DataGenerator(spark, name="test_data_set1", rows=10000, partitions=4)
    .withIdOutput()
    .withColumn("customer_id", "long", minValue=100, maxValue=2147483647, random=True)
    .withColumn("customer_id2", "long", random=True)
    .withColumn("code1", IntegerType(), minValue=100, maxValue=200, random=True)
    .withColumn("code2", "integer", minValue=0, maxValue=10, random=True)
    .withColumn("code3", StringType(), values=["online", "offline", "unknown"], random=True)
    .withColumn(
        "code4", StringType(), values=["a", "b", "c"], random=True, percentNulls=0.05
    )
    .withColumn(
        "code5", "string", values=["a", "b", "c"], random=True, weights=[9, 1, 1]
    )
    .withColumn("code6", "integer",  maxValue=10, random=True)
    .withColumn("code7", "integer",  uniqueValues=50, random=True)
)

Context

Your Environment