Open MaxGhenis opened 3 years ago
@MaxGhenis Yes, psid_lifetime_income.pkl
is too big for GH (~124 MB).
I haven't run that script on Colab, but runs locally fine (assuming you have all dependencies installed).
All columns in Issue #6 are already included in the PSID data saved to the repo.
Recapping next steps from a meeting with @prrathi:
head_age
, spouse_age
and num_children_under18
are exported from the psid_download.R
(see comments in #6 on why these are the fields needed)psid_lifetime_income.pkl
also preserves these variables; if not, may need to add to constant_vars
household_structure.py
, which (a) calculates nu18
, n1864
, and n65
from these variables for each record in psid_lifetime_income
(per #6), (b) calculates the average of each of these by s,j
, and (c) applies the MVKDE
function to smooth these cells (see #25).The KDE functions and the dependent scipy.stats.gaussian_kde
require probability data. I think we have two options:
s
by j
by nu18
(or n1864
or n65
, separately). scipy.stats.gaussian_kde
accepts multivariate (not just bivariate) data, so this should work, and then we can compute the average in each s
x j
cell using the density estimates.@jdebacker what would you suggest?
Actually @prrathi and I realized that we could use the existing KDE function where we model each sxj's share of total children/adults/seniors in the same way that e.g. the share of total transfers by sxj is modeled. Then we can multiply that by the current number of children/adults/seniors to get the average by sxj.
Yes - that is a good solution!
Some updates:
@prrathi tried the KDE with some PSID data, but it was still noisy because it's the quotient of a smoothed numerator (# kids in bin) and unsmoothed denominator (# families in bin). He's going to try smoothing the denominator too.
Given the PSID data issues described in #28, we tried returning to the taxdata CPS file in this notebook, and using stratified LOESS. Here's the raw data for 18-64: To avoid the jumps, @prrathi is going to start with the counts excluding the household head, then add the household head to the appropriate count based on their age post hoc.
Here's the LOESS smoother with the 18-64 bin, just for household head ages 18-64 to avoid smoothing that spike:
And the residuals:
We tried some different values of frac
(essentially bandwidth, defaults to 0.67), and found that 0.4 avoided large sustained residuals while also avoiding too many inflection points which seem implausible.
If the KDE smoothing for the numerator and denominator doesn't work as well, this stratified LOESS seems pretty good (though a multivariate LOESS would be better). @rickecon fyi.
Implementing UBI directly in OG-USA (https://github.com/PSLmodels/OG-USA/issues/626) requires calibrating the number of people per tax unit by
s,j
, split for each of the age groups that could have different UBI amounts, currently 0-17, 18-64, and 65+. We'll want to calculate the value pers,j
and then apply kernel density smoothing.@prrathi and I calculated unsmoothed values using CPS tax units in this notebook. Next step is to do it with PSID instead.
Seems like we can use
psid_data_setup.py
for this. Our first try crashed Colab but @prrathi will try it again.@jdebacker, is
psid_lifetime_income.pkl
, produced in that script, too big for GitHub?Or will we have to hold onto the columns listed in https://github.com/PSLmodels/OG-USA-Calibration/issues/6 and aggregate them along the way anyway, requiring modification to
psid_data_setup.py
?