databrickslabs / dbldatagen

Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines
https://databrickslabs.github.io/dbldatagen
Other
309 stars 58 forks source link

Improve build ordering dependencies #158

Closed ronanstokes-db closed 1 year ago

ronanstokes-db commented 1 year ago

Expected Behavior

When a column contains a SQL expression, if the SQL contains references to other columns that appear prior to the current column - adjust the build sequencing so that those columns are created first.

Current Behavior

You have to specify via baseColumn attribute in all cases

Improved behavior

If simple identifier parser detects valid SQL identifiers in the SQL expression and a) they are not inside a string, and b) they match an existing column name, use separate phases for generation of the column which references the other columns.

This will reduce the number of cases where it is necessary to explicitly reference the baseColumns