NYCPlanning / db-cpdb

Capital Projects Database
https://nycplanning.github.io/db-cpdb
0 stars 0 forks source link

Export source data versions #135

Closed fvankrieken closed 1 year ago

fvankrieken commented 1 year ago

Closes #131

131 references pluto as an example, but I went with a different approach as of this moment. open to discuss pros/cons/alternate approaches.

In pluto, there's a sql file. This sql gets called via psql at time of data export, selecting version number from each temporary sql table to dump it into a csv. Pro of this is that it happens at the same time as all the other data exporting, so separating out steps of pipeline in a logical grouping.

Downside is that it's a fair amount of code query version number explicitly, one at a time, when there's already a step of the pipeline where we explicitly get version number for every imported data source - at time of data loading/import. So here, I went ahead and just created and built that csv at time of import.

fvankrieken commented 1 year ago

lgtm! and makes sense to me to document versions at import rather than at export

a small con of switching approaches is that there'd no longer be a SQL table with data source versions. but this pipeline doesn't expect persistent tables and doesn't use data source versions during the build so hardly a con at all

I guess the longer term solution then might be to create a sql file at time of import? And then just dump the whole table at export. Still fairly concise both code and logic-wise but keeps a similar separation of logic by step and would also translate to an environment where db is persisted

fvankrieken commented 1 year ago

As in during import, instead of dumping to the output csv, dump to a sql file with insert into table values ((map_pluto_wi, 22v3), ...)

damonmcc commented 1 year ago

As in during import, instead of dumping to the output csv, dump to a sql file with insert into table values ((map_pluto_wi, 22v3), ...)

I like it. I think terminology is confusing me though, because I think we're describing:

fvankrieken commented 1 year ago

I like it. I think terminology is confusing me though, because I think we're describing:

  • inserting into a sql table
  • dumping that table to a csv file at export

That is what I mean, yeah

fvankrieken commented 1 year ago

I think the latest changes make sense. In addition to creating the table with this info, it adds the version column to each table as in pluto (in the exported files as well as in the temp db). This is a decent tweak to the output data so let's confirm with @AmandaDoyle on Monday this is what she was thinking with #131, or if she'd prefer just the output file. I can see the argument either way

fvankrieken commented 1 year ago

I think the latest changes make sense. In addition to creating the table with this info, it adds the version column to each table as in pluto (in the exported files as well as in the temp db). This is a decent tweak to the output data so let's confirm with @AmandaDoyle on Monday this is what she was thinking with #131, or if she'd prefer just the output file. I can see the argument either way

What am I talking about - these are the import tables, not the export. Anyways, still depends if we think it's useful to have the versino number in the tables as its own column.

fvankrieken commented 1 year ago

So @AmandaDoyle , main difference right now between here and pluto is that for pluto, at time of import version is only added to each source table, while here it's added to both the source data table as well as a specific versions table, source_data_versions, then at export that table is just dumped

damonmcc commented 1 year ago

@fvankrieken although it's unlikely to break anything downstream (e.g. the Capital Planning Explorer), having a version column in every table seems to only be valuable when data from multiple versions frequently occupy the same database

doubt it's adding much repetitive data since there aren't a ton of records in these tables, so no worries if you wanna keep the columns