ArrayType(StringType()) columns result in Null column, doesn't take values from `values` argument

databrickslabs / dbldatagen

Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines

Other

302 stars 57 forks source link

Expected Behavior

With version 0.2.1 I used to be able to make a column containing string arrays like this:

df_spec = dg.DataGenerator(spark, name="test-data", rows=2)
df_spec = df_spec.withColumn(
    "test",
    ArrayType(StringType()),
    values=[
        F.array(F.lit("A")),
        F.array(F.lit("C")),
        F.array(F.lit("T")),
        F.array(F.lit("G")),
    ],
)
test_df = df_spec.build()

And I would receive a Dataframe with a column named "test", with 2 rows of values picked from [["A"], ["C"], ["T"], ["G"]]

Current Behavior

When I execute the same code with version 0.3.1+ (tried 0.3.1, 0.3.2, 0.3.3, 0.3.4, 0.3.4.post1), the resulting "test" column only contains None values.

Steps to Reproduce (for bugs)

Run the sample code provided in Expected Behavior

Your Environment

dbldatagen version used: 0.3.1-0.3.4
Databricks Runtime version: N/A, testing locally
Cloud environment used: N/A testing locally
python 3.8.10
mac arm64

df_spec = dg.DataGenerator(spark, name="test-data", rows=2) df_spec = df_spec.withColumn( "test", StringType(), # can also use "string" values=["A", "C", "T", "G"], numFeatures=(1,3), structType="array" ) test_df = df_spec.build()

databrickslabs / dbldatagen