databrickslabs / dbldatagen

Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines
https://databrickslabs.github.io/dbldatagen
Other
302 stars 57 forks source link

ArrayType(StringType()) columns result in Null column, doesn't take values from `values` argument #209

Closed zyd14 closed 1 year ago

zyd14 commented 1 year ago

Expected Behavior

With version 0.2.1 I used to be able to make a column containing string arrays like this:

df_spec = dg.DataGenerator(spark, name="test-data", rows=2)
df_spec = df_spec.withColumn(
    "test",
    ArrayType(StringType()),
    values=[
        F.array(F.lit("A")),
        F.array(F.lit("C")),
        F.array(F.lit("T")),
        F.array(F.lit("G")),
    ],
)
test_df = df_spec.build()

And I would receive a Dataframe with a column named "test", with 2 rows of values picked from [["A"], ["C"], ["T"], ["G"]]

Current Behavior

When I execute the same code with version 0.3.1+ (tried 0.3.1, 0.3.2, 0.3.3, 0.3.4, 0.3.4.post1), the resulting "test" column only contains None values.

Steps to Reproduce (for bugs)

Run the sample code provided in Expected Behavior

Your Environment

ronanstokes-db commented 1 year ago

Thanks for reporting this. There is a work around for now:

df_spec = dg.DataGenerator(spark, name="test-data", rows=2)
df_spec = df_spec.withColumn(
    "test",
    StringType(),  # can also use "string"
    values=["A", "C", "T", "G"],
    numFeatures=(1,3), 
    structType="array"
)
test_df = df_spec.build()
zyd14 commented 1 year ago

great, thanks for the workaround and quick response!