databrickslabs / dbldatagen

Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines
https://databrickslabs.github.io/dbldatagen
Other
302 stars 57 forks source link

How to set template and min,max value for a nested schema attribute #229

Open galaxy79 opened 1 year ago

galaxy79 commented 1 year ago

Expected Behavior

I have a nested schema for the data set and want to set the value template patterns for the attributes bankAcctId,bankProduct,bankProduct,storeGroup,association,merchantId,terminalId using withColumnSpec to generate the synthetic data.

my_schema = StructType(
    [
        StructField(
            "bank",
            StructType(
                [
                    StructField("bankAcctId", StringType()),
                    StructField("bankProduct", StringType()),
                ]
            ),
        ),
        StructField(
            "merchDetails",
            StructType(
                [
                    StructField("storeGroup", StringType()),
                    StructField("association", StringType()),
                    StructField("merchantId", StringType()),
                    StructField(
                        "terminal",
                        StructType(
                            [
                                StructField("terminalId", StringType()),
                                StructField("cardholderActivatedTerm", StringType()),
                                StructField(
                                    "posInteractionTerminalEntryMode", StringType()
                                ),
                            ]
                        ),
                    ),
                ]
            ),
        ),
    ]
)

I tried the below code snippet to build the synthetic data

testDataSpec = (
    dg.DataGenerator(spark, name="test_data_set1", rows=row_count, partitions=4)
    .withIdOutput()
    .withSchema(my_schema)
)

testDataSpec = (
    testDataSpec.withColumnSpec("bank.bankAcctId", template=r"\\n-\\n")
    .withColumnSpec("merchDetails.storeGroup", template=r"\\n-\\n")
)
dfTestData = testDataSpec.build()

The code execution was failed with error

dbldatagen.utils.DataGenError: DataGenError(msg=' column `bank.bankAcctId` must refer to defined column', baseException=None)

I looking for some direction or example on how to use it.

Your Environment

Running it on mac m1 pro ( macOS venture 13.5)

ronanstokes-db commented 11 months ago

Hi

The way to specify how the data is generated for nested structures is to create temporary fields and generate the values for them and then combine the generated fields into the desired structure. You cant refer to a nested field in the data generation rules at present.

See the following documentation page for more information: https://databrickslabs.github.io/dbldatagen/public_docs/generating_json_data.html#generating-complex-column-data

I'll update the documentation to provide some clearer examples when creating the data using an existing schema