NYCPlanning / data-engineering-qaqc

streamlit app for data engineering
https://edm-data-engineering.nycplanningdigital.com
1 stars 0 forks source link

Customization of output data selection #255

Open fvankrieken opened 1 year ago

fvankrieken commented 1 year ago

Bit of a prospective issue, but wanted to start a discussion.

While doing work on cpdb codebase, I obviously wanted to do some qc of the data being generated. QCQA is a bit rigid right now in how we compare versions for various repos. For cpdb, we choose a branch (which right now is actually just hard-coded to "main"), and then choose a historical version to compare to - this also a hard-coded list at the moment.

For my testing, it would be great to be able to choose from a list of branches, sorted maybe in order of recent activity (after main), and then also have the ability to choose from historical versions for that branch that are in DO as well as different branches for each file so that I could compare my dev branch latest to main latest.

Gets a little more complex for other products - I know some comparisons rely more on pre-processing. But I'd love to flush out what we'd want out of slightly more dynamic options on a product-by-product basis. No rush, just wanted to get this down as a starting point.

fvankrieken commented 1 year ago

Related to #249, #242

fvankrieken commented 1 year ago

FacDB is an example of a db where we do this for the git branches, though it could use some work in clarity on what data are being compared

damonmcc commented 1 year ago

definitely worth scoping this out to cover the two features of which branch's data to QA and which past data to compare it to

for now though, seems valuable to upgrade the approaches in some QAQC reports (e.g. DevDB) to at least not be hardcoded to certain branches and emulate the FacDB approach of using the list of repo branches