Closed SashaWeinstein closed 2 years ago
@AmandaDoyle I have made some progress on the replicate weights variances calculations.
The samplics ReplicateEstimator
class accepts three methods: bootstrap
, brr
, and jackknife
.
On page 66 of the NYCHVS guide to estimating variances, it says
Although we are using SDR replicate weights, SAS assumes that we are using Balanced Repeated Replication (BRR) replicate weights. For this reason, the estimates from PROC SURVEYLOGISTIC are slightly different than the other methods.
The samplics documentation doesn't mention SDR weights. I think it's worth asking population about the difference but for now I'm going to proceed with assumption that samplics doesn't support the variance calculation that HVS uses.
Erica is out of town until the 15, should I ask Joel and cc her?
This blog post was my guide to calculating SE for each bucket in a group by
Next three steps are: 1) Calculate medians based on directions in NYCHVS guide to calculating variances 2) Read about how to call R code from python module. If it's easy to work with then it may be better than samplics even if samplics can do the calculations we want. Having a common language with HVS to interpret their technical documentation and work with them directly might be important as well 3) Find or create mapping between HVS sub-borough areas and PUMAs. Good one-off task for when I have a short block of time
I found a python package Rpy2 that allows for calling R functions from a python script. I recreated the calculation of total number of occupied units with SE based of the NYCHVS guide to calculating variances. I think that this is a better approach than samplics, there is better documentation online for googling through problems. I have three to-dos right now:
The approach to calculating variances was finalized as Rpy2
Ingestion pipeline for PUMS is working and it's time to start aggregating the data. The first variable to calculate is
Limited English Speaking Population in 2015-2019
. I chose to start with rhis variable because it uses PUMS data and relies only on demographic info.Calculate Count
Calculating the count requires only the person weight in the PWGTP column. Per page 9 of the PUMS documentation
This should be pretty straightforward. If I get stuck on the variance I will work more on these counts.
Calculate Variance
This is more complex as it relies on not just one weight but 80 replicate weights in columns PWGTP1-80. These 80 replicate weights are used to find a single measure of standard error via this formula on page 13 of the PUMS documentation
The technical documentation for the NYC housing and vacancy survey has detailed instructions on how to implement this equation (or possibly a distinct, similar equation) in SAS, STATA and R. The python package samplics has an estimation replication implementation that may use the same equation. I'm quite sure that the math is exactly the same so my plan is to go through an example from the HVS in samplics and see if I arrive at the same answer. If I do then I'll implement a samplics function to calculate variance for both PUMS and HVS which would be simplest. If not an implementation that calls an R function from the python code will have to be developed.
Note that this assumes that the HVS and PUMS data should use the same equation for variance. I assume that they do but will need to double check.