NYCPlanning / db-equitable-development-tool

Data Repo for the equitable development tool (EDDT)
MIT License
0 stars 0 forks source link

source data version control #294

Closed td928 closed 2 years ago

td928 commented 2 years ago

293 multiple reviewers required 🐬 🐬

Overview

Stemmed from the request from SOE to hold the source data constant from previous update while also adding new fields to the output which leads to the question: Do we really know which versions of source data we are using in our output? After Max and I figured out which datasets/output are impacted by the open data sources we pulled. This leads me on this mini-expedition to figure out how to keep track of the versions of the few open data sources (for now) in the pipeline. Then I remembered @SashaWeinstein already did this for the facility database, so I went over there and grabbed a lot of the codes off the shelf.

ingest_helpers.py and metadata.py

This works does expand across a few files so might require some patience to review each. But the thread tying everthing together is the new code in ingest_helpers.py and metadata.py. metadata.py is almost verbatim from the facility database. I do wonder if we should change it a little to have the dump metadata.yml somewhere else but that should be a relatively easy change to make.

ingest_helpers.py was slimmed down a little bit to do just the job. I was debating whether the fetch version from config file is useful for this task and decided we could do without it. So now it takes the version from the datasets.yml file and check whether it is downloaded in the ingest/data_library/dataloading.sh then would read in the file.

datasets and indicator scripts

There are currently only four datasets coming directly from open data. They are all listed in the datasets.yml file. Then each of the indicator impacted by those source data were updated with read_from_s3 functions. Some additional tinkering was required for e.g. historical district where a pandas dataframe is ingested first then converted to geopandas.

export metadata

using the existing infrastructure for exporting. I modify both the workflow file export.yml and also export_DO.sh to accomodate the need for the metadata.yml exporting. I am depositing this file directly in the root of the db-eddt folder on DO.

SashaWeinstein commented 2 years ago

This pipeline is bad what idiot designed this lol

SashaWeinstein commented 2 years ago

Also I just realized that most files aren't formatted with black. We should address this in this PR or a separate one. Whenever you're ready to think about it

td928 commented 2 years ago

Cast lat/long columns to appropriate datatype

@mbh329 let me know if the latest commit fixed this issue for you. Thanks!

td928 commented 2 years ago

also fixed for residential eviction @mbh329

mbh329 commented 2 years ago

The latest commit works on the for housing_production at the puma level but still an issue with housing_security. Seems to be an issue with evictions

mbh329 commented 2 years ago

Fixed, the housing_security and housing_production at the puma level work as expected @td928

mbh329 commented 2 years ago

That is fine with me definitely want to make sure this right

mbh329 commented 2 years ago

Would it be helpful to go over this together with a little code review? @td928 @SashaWeinstein

SashaWeinstein commented 2 years ago

The next step for me is to run the new code, can get on a call later in the day

SashaWeinstein commented 2 years ago

Lol Te why did you edit my comment instead of responding

td928 commented 2 years ago

Lol Te why did you edit my comment instead of responding

some next level move by me. I guess now we can merge this whole thing in? @SashaWeinstein

SashaWeinstein commented 2 years ago

Doing my final review now

SashaWeinstein commented 2 years ago

The refactor to move the hardcoding of the columns to separate functions looks good to me. The only thing is that I think the convention is to return List[str] instead of list https://stackoverflow.com/questions/52623204/how-to-specify-method-return-type-list-of-what-in-python