NYCPlanning / data-engineering

Primary repository for NYC DCP's Data Engineering team
14 stars 0 forks source link

Persistent database for all builds #17

Closed damonmcc closed 7 months ago

damonmcc commented 1 year ago

task list

an issue for brainstorming

with the understanding that we like the idea of a single Postgres DB to serve as our data warehouse for building all data products, some thoughts:

potential data warehouse / build engine design

plan for trying this out

damonmcc commented 1 year ago

follow-up thoughts on testing:

fvankrieken commented 1 year ago

Could also have deletion of temp dbs as a github action triggered by PRs being merged based on branch name

damonmcc commented 1 year ago

so far the thoughts above are all about the Transform part of ELT. some thoughts on the Load part:

fvankrieken commented 1 year ago

I think this is beyond what we want in a first iteration but just food for thought

https://lakefs.io/

Just thinking about having tables by versions vs something that's not so redundant in storage like having valid_to and valid_from columns. This is sort of another step past that. I like valid_to and valid_from for some reasons but it requires a fair amount of work on our end in terms of how we make sure we're updating safely and then QOL when querying is a little lower - have to more manually specify within the queries

fvankrieken commented 1 year ago

And like your rundown there, those last two bullets especially.

One other thought on general structure - we could have the production dataset schemas, which would be something like schema = dataset name and tablename = date (or latest. Or maybe we skip latest and have a log of the latest versions per schema elsewhere). That seems to be the only way to maintain our current rigorous versioning without doing anything too crazy. But then we could have build schemas rather than dbs - create a schema based on a given branch or something along those lines, or based a github action id/etc for cloud-based "production" builds. Keep all of the intermediate tables within this schema, final step could be copy final table out to a more standardized place.

I really wish postgres just had one more level of hierarchy.

fvankrieken commented 1 year ago

But I really agree with the point that build (transform) should be separated from EL, and if anything should really just be checking the state of things (and maybe able to kick off some E/L if it can definitively tell that data is out of date somehow) and running from there

fvankrieken commented 1 year ago

And like your rundown there, those last two bullets especially.

One other thought on general structure - we could have the production dataset schemas, which would be something like schema = dataset name and tablename = date (or latest. Or maybe we skip latest and have a log of the latest versions per schema elsewhere). That seems to be the only way to maintain our current rigorous versioning without doing anything too crazy. But then we could have build schemas rather than dbs - create a schema based on a given branch or something along those lines, or based a github action id/etc for cloud-based "production" builds. Keep all of the intermediate tables within this schema, final step could be copy final table out to a more standardized place.

I really wish postgres just had one more level of hierarchy.

And with that, we could maybe get away with a "production" db that only deployed pipelines (ghas from main for now) are run on and a "staging" db (not staging in dbt parlance but more just a "testing" db) that any non-main/deployed things are being run on. Theoretically could maybe get away with 1 db but that doesn't seem like a good/safe long-term solution

damonmcc commented 10 months ago

thoughts per brainstorm on 7/17

damonmcc commented 9 months ago

noting this should be revised for new draft/publish

damonmcc commented 8 months ago

per roadmap chat

damonmcc commented 8 months ago

expected changes to relevant builds

expected changes to common build machinery and a new scheduled action