Closed ericfeunekes closed 1 year ago
Thanks for your feedback
If the baseColumnType is hash
, then the generated column for customer_id will compute a hash of the base column - by default , the column named id
.
Use of the unique_values option will then apply modulo operation to restrict the generated values to the range of values indicated for possible unique values.
I'll review the generated data to see to see if the customer_ids are consistent with the above
I think that makes sense. But my issue is more for the first_name
column. It uses base_column=customer_id
, therefore I would expect that for each unique customer_id
there should be one and only one first_name
.
But that wasn't the case, regardless that there are only two unique customer_id
values, first_name
would be randomly generated in every row.
It's the same thing for phone_number
.
A couple of comments :
1 - the FakerText integration is not deterministic at present - so running multiple times will produce different results 2 - I'll look at the phone number generation
Overall data is repeatable from run to run with the exception of use of the Faker plugins and data generated from templates
The following code will generate a psuedo first name that is repeated from run to run
rows = 10
partitions = 1
unique_customers = 2
generator = (DataGenerator(spark, name="demo", rows=rows, partitions=partitions,randomSeedMethod='hash_fieldname')
.withIdOutput()
.withColumn("customer_id", IntegerType(), uniqueValues=unique_customers, baseColumnType="hash")
.withColumn("name", text=dg.ILText(words=(1, 1)))
.withColumn("phone", template=r'(ddd)-ddd-dddd|1(ddd) ddd-dddd|ddd ddddddd', base_column="customer_id")
)
df = generator.build()
display(df)
Issues related to generation of template text values will be in the next release (v0.2.2)
Fixes have been made for repeatable text data generation in recent release.
Any Faker based data generation will continue to be non repeatable unless there are enhancements to the underlying Faker library
Expected Behavior
When generating any data, if
baseColumn
is set to a reference column inwithColumn
then the data generated for the new column should be the same when the value of the reference column is the same.For example, for the following code:
I would expect that there would be two first names and they would be consistent for the values in
customer_id
Current Behavior
With the example above, I get any number of random values in
first_name
andphone_number
.Steps to Reproduce (for bugs)
See the code above
Context
Trying to generate data with consistent values within a row.
Your Environment
dbldatagen
version used: v 0.2.0-rc0 public preview 2