Persistent database for all builds - Githubissues

NYCPlanning / data-engineering

Primary repository for NYC DCP's Data Engineering team

20 stars 0 forks source link

Persistent database for all builds #17

Closed damonmcc closed 11 months ago

damonmcc commented 1 year ago

task list

[x] #192
[x] #269
[x] #307

an issue for brainstorming

with the understanding that we like the idea of a single Postgres DB to serve as our data warehouse for building all data products, some thoughts:

the current DB named edm-data is persistent and hosted on Digital Ocean. happy to use it
in Postgres, the hierarchy of "database objects" is Server -> Databases -> Schemas -> Tables
cross-database queries are not possible
potential design:
- three databases: Test, Staging, Production
- test builds would target the Test DB (local, CI)
- normal builds would target the Staging DB (local, CI)
- staged data could be "promoted" to the Production DB by copying it (via pg_dump)

potential data warehouse / build engine design

use 3 levels of DBs to isolate dev and production data
the highest level would be the "endpoint" for both exporting and querying build outputs. all data there would have been "promoted" after QAQC
record build details when writing to a DB
Test
- meant to be isolated versions of Staging
- used by default for non-release builds
- structure
- permutations of Staging schemas (with prefixes based on the environment)
Staging
- meant to be the production build engine
- used for release builds
- structure
- a schema for all source tables and a table for source details
- a schema for product details and build details
- a schema for each product with all releases
Publishing
- populated by copying from Staging
- structure
- a schema for each product

plan for trying this out

document desired database structure(s)
document high-level build sequence needed to use data warehouse
change a data product to use this (db-template?)
change at least one more data product in order to test using common source data (db-zoningtaxlots, db-checkbook?)

damonmcc commented 1 year ago

follow-up thoughts on testing:

we probably don't want a lot a test databases
to allow isolated testing for multiple developers/PRs in the same repo:
- maybe create and name schemas in the Test DB to have a prefix/suffix unique to the source of the testing
- e.g. test.pluto_pr_80, test.pluto_workflow_dispatch_branch_name, test.ztl_local_damon

fvankrieken commented 1 year ago

Could also have deletion of temp dbs as a github action triggered by PRs being merged based on branch name

damonmcc commented 1 year ago

so far the thoughts above are all about the Transform part of ELT. some thoughts on the Load part:

reminder that cross-DB queries aren't possible. so if source data isn't already in a Transformation DB (e.g. edm-date.staging), it'd have to be copied into it before a build. maybe we like that? the postgres utility pg_dump would work!
to simplify:
- Extract = find the latest flat file or hit an API and load to DB
- Load = put the results of an extract into the data warehouse
edm-data.recipe is an existing example of source data in our data warehouse. data can currently be loaded to it via data-library, but none of our primary datasets use it (they use sql dump files in S3 and often load them into temporary DBs)
the current structure of edm-data.recipe is generally:
- schema: dataset_name
- table: version_by_date
- schema: public
- table: dataset_name OR datasource
highlighting that datasource table because it has metadata about each dataset: dataset_name, description, date_of_update, date_downloaded
starting a build probably shouldn't include the Load step
- if it does, would we end up with a lot of redundant loading?
- aren't the cadences of source data updates independent of any single product's build cadence?
- pursuing single-responsibility?
in the data warehouse, we would probably like having multiple dated versions of the same source data available to use in builds
I like the value of something like the datasource table to document and query details about our source data
a build (and or QAQC) should probably be able to check the "freshness" of source data. at the least, that'd be documented along with the build output. and a build could warn about stale source data and maybe not even build based on that

fvankrieken commented 1 year ago

I think this is beyond what we want in a first iteration but just food for thought

https://lakefs.io/

Just thinking about having tables by versions vs something that's not so redundant in storage like having valid_to and valid_from columns. This is sort of another step past that. I like valid_to and valid_from for some reasons but it requires a fair amount of work on our end in terms of how we make sure we're updating safely and then QOL when querying is a little lower - have to more manually specify within the queries

fvankrieken commented 1 year ago

And like your rundown there, those last two bullets especially.

One other thought on general structure - we could have the production dataset schemas, which would be something like schema = dataset name and tablename = date (or latest. Or maybe we skip latest and have a log of the latest versions per schema elsewhere). That seems to be the only way to maintain our current rigorous versioning without doing anything too crazy. But then we could have build schemas rather than dbs - create a schema based on a given branch or something along those lines, or based a github action id/etc for cloud-based "production" builds. Keep all of the intermediate tables within this schema, final step could be copy final table out to a more standardized place.

I really wish postgres just had one more level of hierarchy.

fvankrieken commented 1 year ago

But I really agree with the point that build (transform) should be separated from EL, and if anything should really just be checking the state of things (and maybe able to kick off some E/L if it can definitively tell that data is out of date somehow) and running from there

fvankrieken commented 1 year ago

And like your rundown there, those last two bullets especially.

One other thought on general structure - we could have the production dataset schemas, which would be something like schema = dataset name and tablename = date (or latest. Or maybe we skip latest and have a log of the latest versions per schema elsewhere). That seems to be the only way to maintain our current rigorous versioning without doing anything too crazy. But then we could have build schemas rather than dbs - create a schema based on a given branch or something along those lines, or based a github action id/etc for cloud-based "production" builds. Keep all of the intermediate tables within this schema, final step could be copy final table out to a more standardized place.

I really wish postgres just had one more level of hierarchy.

And with that, we could maybe get away with a "production" db that only deployed pipelines (ghas from main for now) are run on and a "staging" db (not staging in dbt parlance but more just a "testing" db) that any non-main/deployed things are being run on. Theoretically could maybe get away with 1 db but that doesn't seem like a good/safe long-term solution

damonmcc commented 1 year ago

thoughts per brainstorm on 7/17

all of our pipelines currently
- use SQL, python, or both to transform source data
a short-term build DB is distinct from any long-term data storage
- storage: source data, build outputs
- build engine: intermediate and final build tables
- builds often use temporary file storage as well as a DB
using postgres will be the least disruptive as opposed to BigQuery or any other types of DB
when a DB is currently used, there's no schema specified so queries use the default schema (e.g. select * from public.table_name)

damonmcc commented 1 year ago

noting this should be revised for new draft/publish

damonmcc commented 11 months ago

per roadmap chat

let's keep scope limited to
- not using the same DB for every product
- not using a common DB for source data (would avoid duplicative/repetitive loading in each build)
will be nice to change all products to use a DB in the cluster to build
should delete schemas on a regular basis
- if the branch/PR has been deleted/merged (aka can't be found when checked), drop the schema
- weekly or daily? kinda prefer weekly
should probably delete very old things in the edm-data server first (e.g. recipes DB)

damonmcc commented 11 months ago

expected changes to relevant builds

manually create a build DB for each product with the extensions postgis and fuzzystrmatch
use a branch-specific schema in the product's build DB
replace all use of a temporary github DB with the persistent edm-date postgres server

expected changes to common build machinery and a new scheduled action

delete draft build schema and exports in two situations:
- when a new draft build starts
- periodically to reclaim storage space (if relevant branch doesn't exist, delete. weekly?)