databrickslabs / dbldatagen

Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines
https://databrickslabs.github.io/dbldatagen
Other
310 stars 58 forks source link

Generating text with baseColumn not consistent #103

Closed ericfeunekes closed 1 year ago

ericfeunekes commented 2 years ago

Expected Behavior

When generating any data, if baseColumn is set to a reference column in withColumn then the data generated for the new column should be the same when the value of the reference column is the same.

For example, for the following code:

rows = 10
partitions = 1

unique_customers = 2

generator = (DataGenerator(spark, name="demo", rows=rows, partitions=partitions,randomSeedMethod='hash_fieldname')
 .withIdOutput()
 .withColumn("customer_id", IntegerType(), uniqueValues=unique_customers, baseColumnType="hash")
 .withColumn("first_name", text=fakerText("first_name"), base_column="customer_id")
 .withColumn("phone", template=r'(ddd)-ddd-dddd|1(ddd) ddd-dddd|ddd ddddddd', base_column="customer_id")
)

df = generator.build()
display(df)

I would expect that there would be two first names and they would be consistent for the values in customer_id

Current Behavior

With the example above, I get any number of random values in first_name and phone_number.

Steps to Reproduce (for bugs)

See the code above

Context

Trying to generate data with consistent values within a row.

Your Environment

ronanstokes-db commented 2 years ago

Thanks for your feedback

If the baseColumnType is hash, then the generated column for customer_id will compute a hash of the base column - by default , the column named id.

Use of the unique_values option will then apply modulo operation to restrict the generated values to the range of values indicated for possible unique values.

I'll review the generated data to see to see if the customer_ids are consistent with the above

ericfeunekes commented 2 years ago

I think that makes sense. But my issue is more for the first_name column. It uses base_column=customer_id, therefore I would expect that for each unique customer_id there should be one and only one first_name.

But that wasn't the case, regardless that there are only two unique customer_id values, first_name would be randomly generated in every row.

It's the same thing for phone_number.

ronanstokes-db commented 1 year ago

A couple of comments :

1 - the FakerText integration is not deterministic at present - so running multiple times will produce different results 2 - I'll look at the phone number generation

ronanstokes-db commented 1 year ago

Overall data is repeatable from run to run with the exception of use of the Faker plugins and data generated from templates

The following code will generate a psuedo first name that is repeated from run to run

rows = 10
partitions = 1

unique_customers = 2

generator = (DataGenerator(spark, name="demo", rows=rows, partitions=partitions,randomSeedMethod='hash_fieldname')
 .withIdOutput()
 .withColumn("customer_id", IntegerType(), uniqueValues=unique_customers, baseColumnType="hash")
.withColumn("name", text=dg.ILText(words=(1, 1)))
 .withColumn("phone", template=r'(ddd)-ddd-dddd|1(ddd) ddd-dddd|ddd ddddddd', base_column="customer_id")
)

df = generator.build()
display(df)
ronanstokes-db commented 1 year ago

Issues related to generation of template text values will be in the next release (v0.2.2)

ronanstokes-db commented 1 year ago

Fixes have been made for repeatable text data generation in recent release.

Any Faker based data generation will continue to be non repeatable unless there are enhancements to the underlying Faker library