Closed damonmcc closed 11 months ago
follow-up thoughts on testing:
test.pluto_pr_80
, test.pluto_workflow_dispatch_branch_name
, test.ztl_local_damon
Could also have deletion of temp dbs as a github action triggered by PRs being merged based on branch name
so far the thoughts above are all about the Transform part of ELT. some thoughts on the Load part:
edm-date.staging
), it'd have to be copied into it before a build. maybe we like that? the postgres utility pg_dump
would work!edm-data.recipe
is an existing example of source data in our data warehouse. data can currently be loaded to it via data-library, but none of our primary datasets use it (they use sql dump files in S3 and often load them into temporary DBs)edm-data.recipe
is generally:
dataset_name
version_by_date
public
dataset_name
OR datasource
datasource
table because it has metadata about each dataset: dataset_name
, description
, date_of_update
, date_downloaded
datasource
table to document and query details about our source dataI think this is beyond what we want in a first iteration but just food for thought
Just thinking about having tables by versions vs something that's not so redundant in storage like having valid_to
and valid_from
columns. This is sort of another step past that. I like valid_to
and valid_from
for some reasons but it requires a fair amount of work on our end in terms of how we make sure we're updating safely and then QOL when querying is a little lower - have to more manually specify within the queries
And like your rundown there, those last two bullets especially.
One other thought on general structure - we could have the production dataset schemas, which would be something like schema = dataset name and tablename = date (or latest. Or maybe we skip latest and have a log of the latest versions per schema elsewhere). That seems to be the only way to maintain our current rigorous versioning without doing anything too crazy. But then we could have build schemas rather than dbs - create a schema based on a given branch or something along those lines, or based a github action id/etc for cloud-based "production" builds. Keep all of the intermediate tables within this schema, final step could be copy final table out to a more standardized place.
I really wish postgres just had one more level of hierarchy.
But I really agree with the point that build (transform) should be separated from EL, and if anything should really just be checking the state of things (and maybe able to kick off some E/L if it can definitively tell that data is out of date somehow) and running from there
And like your rundown there, those last two bullets especially.
One other thought on general structure - we could have the production dataset schemas, which would be something like schema = dataset name and tablename = date (or latest. Or maybe we skip latest and have a log of the latest versions per schema elsewhere). That seems to be the only way to maintain our current rigorous versioning without doing anything too crazy. But then we could have build schemas rather than dbs - create a schema based on a given branch or something along those lines, or based a github action id/etc for cloud-based "production" builds. Keep all of the intermediate tables within this schema, final step could be copy final table out to a more standardized place.
I really wish postgres just had one more level of hierarchy.
And with that, we could maybe get away with a "production" db that only deployed pipelines (ghas from main for now) are run on and a "staging" db (not staging in dbt parlance but more just a "testing" db) that any non-main/deployed things are being run on. Theoretically could maybe get away with 1 db but that doesn't seem like a good/safe long-term solution
select * from public.table_name
)noting this should be revised for new draft/publish
per roadmap chat
expected changes to relevant builds
expected changes to common build machinery and a new scheduled action
task list
an issue for brainstorming
with the understanding that we like the idea of a single Postgres DB to serve as our data warehouse for building all data products, some thoughts:
edm-data
is persistent and hosted on Digital Ocean. happy to use itTest
,Staging
,Production
pg_dump
)potential data warehouse / build engine design
plan for trying this out