NYCPlanning / db-data-library

📚 Data Library
https://nycplanning.github.io/db-data-library/library/index.html
MIT License
0 stars 1 forks source link

fix archiving to postgres #392

Closed damonmcc closed 1 year ago

damonmcc commented 1 year ago

changes

fvankrieken commented 1 year ago

Shouldn't this only have the change here? This is going to change output folder paths in DO. Not sure if we currently have any that output versions with capital letters, but still think it's best to only do this specifically for the table name and nowhere else

fvankrieken commented 1 year ago

Shouldn't this only have the change here? This is going to change output folder paths in DO. Not sure if we currently have any that output versions with capital letters, but still think it's best to only do this specifically for the table name and nowhere else

Or if for these cases, you want to change the version name entirely, still only do it for cases where the output is postgres and not others.

Ahh but then we run into inconsistencies if we write to both csv and postgres. hmm

fvankrieken commented 1 year ago

What specific dataset is this needed for?

damonmcc commented 1 year ago

@fvankrieken good call that this only has to apply to the sql statements

and I don't think there'd be any issue with inconsistent casing between csv and postgres (other than we'd probably prefer consistency)

this came up when archiving new adult use input data and I wrote capital letters in the version (20230426_noM2M3)

mbh329 commented 1 year ago

Why were the noM2M3 making it in to the table? I'm assuming you just added them by mistake when specifying the version and this created the issue? We should keep the convention where we pass the version just as a YYYYMMDD without any additional characters

damonmcc commented 1 year ago

I did it on purpose since there are indeed multiple versions of the source data that don't only vary by the date they were acquired, they were generated with different tax lot exclusion criteria (20230425 vs 20230426_nom2m3). the other two options seemed to be:

  1. archive the new data as 20230426

    • since outputs based on 20230425 and 20230426 source data were both needed by GIS, the script had to be run on both of them. so the answer to the question "which is the nom2m3 output?" would have to be stored somewhere
    • seemed better to store the answer as the data version (and the output folder) rather than in comments, human minds, or re-investigating later lol
  2. create a different template yml file

    • doesn't seem worth creating an entirely different database schema (e.g. recipes.dcp_proximity_establishments_nom2m3) when they have the same columns/value types and are gonna be processed in the same way

in data library, if a version isn't explicitly declared when archiving, it does still use the date. so we still have that feature. seems like that's how convention is implemented, via the default behavior of not passing a version

mbh329 commented 1 year ago

@damonmcc that makes sense, I think that the way you went about it was good. Don't think it would have made sense to have two different templates