NYCPlanning / db-equitable-development-tool

Data Repo for the equitable development tool (EDDT)
MIT License
0 stars 0 forks source link

Weighted Aggregations: Counts and Variances #3

Closed SashaWeinstein closed 2 years ago

SashaWeinstein commented 3 years ago

Ingestion pipeline for PUMS is working and it's time to start aggregating the data. The first variable to calculate is Limited English Speaking Population in 2015-2019. I chose to start with rhis variable because it uses PUMS data and relies only on demographic info.

Calculate Count

Calculating the count requires only the person weight in the PWGTP column. Per page 9 of the PUMS documentation

To produce estimates or tabulations of characteristics from the PUMS, add the weights of all persons or HUs that possess the characteristic of interest.2 For instance, if the characteristic of interest is “total number of black teachers”, simply determine the race and occupation of all persons and cumulate the weights of those who match the characteristics of interest.

This should be pretty straightforward. If I get stuck on the variance I will work more on these counts.

Calculate Variance

This is more complex as it relies on not just one weight but 80 replicate weights in columns PWGTP1-80. These 80 replicate weights are used to find a single measure of standard error via this formula on page 13 of the PUMS documentation

Screen Shot 2021-11-03 at 4 42 32 PM

The technical documentation for the NYC housing and vacancy survey has detailed instructions on how to implement this equation (or possibly a distinct, similar equation) in SAS, STATA and R. The python package samplics has an estimation replication implementation that may use the same equation. I'm quite sure that the math is exactly the same so my plan is to go through an example from the HVS in samplics and see if I arrive at the same answer. If I do then I'll implement a samplics function to calculate variance for both PUMS and HVS which would be simplest. If not an implementation that calls an R function from the python code will have to be developed.

Note that this assumes that the HVS and PUMS data should use the same equation for variance. I assume that they do but will need to double check.

SashaWeinstein commented 3 years ago

@AmandaDoyle I have made some progress on the replicate weights variances calculations.

The samplics ReplicateEstimator class accepts three methods: bootstrap, brr, and jackknife. On page 66 of the NYCHVS guide to estimating variances, it says

Although we are using SDR replicate weights, SAS assumes that we are using Balanced Repeated Replication (BRR) replicate weights. For this reason, the estimates from PROC SURVEYLOGISTIC are slightly different than the other methods.

The samplics documentation doesn't mention SDR weights. I think it's worth asking population about the difference but for now I'm going to proceed with assumption that samplics doesn't support the variance calculation that HVS uses.

Erica is out of town until the 15, should I ask Joel and cc her?

SashaWeinstein commented 3 years ago

This blog post was my guide to calculating SE for each bucket in a group by

SashaWeinstein commented 3 years ago

Next three steps are: 1) Calculate medians based on directions in NYCHVS guide to calculating variances 2) Read about how to call R code from python module. If it's easy to work with then it may be better than samplics even if samplics can do the calculations we want. Having a common language with HVS to interpret their technical documentation and work with them directly might be important as well 3) Find or create mapping between HVS sub-borough areas and PUMAs. Good one-off task for when I have a short block of time

SashaWeinstein commented 3 years ago

I found a python package Rpy2 that allows for calling R functions from a python script. I recreated the calculation of total number of occupied units with SE based of the NYCHVS guide to calculating variances. I think that this is a better approach than samplics, there is better documentation online for googling through problems. I have three to-dos right now:

SashaWeinstein commented 2 years ago

The approach to calculating variances was finalized as Rpy2