NYCPlanning / db-equitable-development-tool

Data Repo for the equitable development tool (EDDT)
MIT License
0 stars 0 forks source link

176 linear interpolation medians #184

Closed SashaWeinstein closed 2 years ago

SashaWeinstein commented 2 years ago

Calculating medians the correct way

All work in this pull request is based off population factfinder's median calculations which can you can read about here. I tried to follow that code as closely as I could, but there are some changes. The biggest change is that I work off a dataframe that is arranged differently than the row used in pff. Additionally throughout the code there are small differences in syntax that I introduced. I wanted to make comparing the functions easy but also use patterns I'm familiar with.

I don't elegantly handle any corner cases like the pff code does. Fortunately none of the pumas have the corner cases for the age variable. For household income and wages we may get more but we will cross that bridge when we come to it.

Integrating new functions into aggregation pipeline

I only call the new function from demographic medians as 1) it's the simplest (best place to start) and 2) demographics is the category we are trying to get out tomorrow.

Internal Review

There is one internal review file with only the 5 test PUMAs. Calculations using this method are much quicker than the survey package. I checked all 5 of the test PUMAs against the summary files and they fell within the margin of error.

I couldn't get all PUMAs because I locked myself out of the census API by making too many requests. There was an error with the while loop that is addressed in this PR. Hopefully tomorrow we can run the full dataset of pumas

Metadata

I get the bins and design factor from a json which is how pff does it. I copied the json straight from that repo as the bins in the data matrix match.

Miscellaneous changes

The order_columns passed to base aggregator init is jank. I wrote an issue for an enhancement that is better design.

Another change is that medians don't download PUMS with replicate weights. I didn't get a chance to test this code on my machine as the API isn't working for me today but hopefully it will work on someone's machine

PUMS GET

Fixed this, counter ticks up to only make 5 requests per attempt.

SashaWeinstein commented 2 years ago

Yes Te that's correct, replicate weights aren't used in linear interpolation. The code should be changed so that the PUMS data used by the median PUMS demographics base aggregator doesn't have replicate weights