databrickslabs / dbldatagen

Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines
https://databrickslabs.github.io/dbldatagen
Other
364 stars 61 forks source link

added changes for better build ordering when SQL expressions are pres… #159

Closed ronanstokes-db closed 1 year ago

ronanstokes-db commented 1 year ago

Proposed changes

Improvement to reduce number of times explicit baseColumn is needed to be specified when a column definition has references to previous columns inside of the SQL expression clause.

It uses a simple identifier parser to determine when there are potential references to earlier columns and adjusts the separation of phases for column building

Types of changes

What types of changes does your code introduce to dbldatagen? Put an x in the boxes that apply

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

Further comments

Each phase of column generation produces a select statement to generate the columns. When the SQL expression refers to other columns, they must have been generated in a previous select statement.

This adjusts the assignment of column generators to select statements.

codecov[bot] commented 1 year ago

Codecov Report

Merging #159 (f23557b) into master (fefd904) will increase coverage by 0.17%. The diff coverage is 96.20%.

@@            Coverage Diff             @@
##           master     #159      +/-   ##
==========================================
+ Coverage   90.15%   90.33%   +0.17%     
==========================================
  Files          22       22              
  Lines        2439     2515      +76     
  Branches      396      416      +20     
==========================================
+ Hits         2199     2272      +73     
- Misses        157      158       +1     
- Partials       83       85       +2     
Impacted Files Coverage Δ
dbldatagen/__init__.py 90.90% <ø> (ø)
dbldatagen/utils.py 90.37% <90.00%> (-0.07%) :arrow_down:
dbldatagen/data_generator.py 84.72% <96.87%> (+0.68%) :arrow_up:
dbldatagen/schema_parser.py 97.67% <100.00%> (+0.41%) :arrow_up:

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.