PSLmodels / OG-USA

Overlapping-generations macroeconomic model for evaluating fiscal policy in the United States
https://pslmodels.github.io/OG-USA/
Creative Commons Zero v1.0 Universal
19 stars 35 forks source link

Calibrate number of people age {0-17, 18-64, 65+} per tax unit by s,j #9

Open MaxGhenis opened 3 years ago

MaxGhenis commented 3 years ago

Implementing UBI directly in OG-USA (https://github.com/PSLmodels/OG-USA/issues/626) requires calibrating the number of people per tax unit by s,j, split for each of the age groups that could have different UBI amounts, currently 0-17, 18-64, and 65+. We'll want to calculate the value per s,j and then apply kernel density smoothing.

@prrathi and I calculated unsmoothed values using CPS tax units in this notebook. Next step is to do it with PSID instead.

Seems like we can use psid_data_setup.py for this. Our first try crashed Colab but @prrathi will try it again.

@jdebacker, is psid_lifetime_income.pkl, produced in that script, too big for GitHub?

Or will we have to hold onto the columns listed in https://github.com/PSLmodels/OG-USA-Calibration/issues/6 and aggregate them along the way anyway, requiring modification to psid_data_setup.py?

jdebacker commented 3 years ago

@MaxGhenis Yes, psid_lifetime_income.pkl is too big for GH (~124 MB).

I haven't run that script on Colab, but runs locally fine (assuming you have all dependencies installed).

All columns in Issue #6 are already included in the PSID data saved to the repo.

MaxGhenis commented 3 years ago

Recapping next steps from a meeting with @prrathi:

  1. Verify that head_age, spouse_age and num_children_under18 are exported from the psid_download.R (see comments in #6 on why these are the fields needed)
  2. Verify that psid_lifetime_income.pkl also preserves these variables; if not, may need to add to constant_vars
  3. Create a new file, e.g. household_structure.py, which (a) calculates nu18, n1864, and n65 from these variables for each record in psid_lifetime_income (per #6), (b) calculates the average of each of these by s,j, and (c) applies the MVKDE function to smooth these cells (see #25).
MaxGhenis commented 3 years ago

The KDE functions and the dependent scipy.stats.gaussian_kde require probability data. I think we have two options:

  1. Smooth with something like LOESS, though I couldn't find a multivariate LOESS smoother in Python
  2. Apply KDE using an extra dimension of the number of people, e.g. determining cells in s by j by nu18 (or n1864 or n65, separately). scipy.stats.gaussian_kde accepts multivariate (not just bivariate) data, so this should work, and then we can compute the average in each s x j cell using the density estimates.

@jdebacker what would you suggest?

MaxGhenis commented 3 years ago

Actually @prrathi and I realized that we could use the existing KDE function where we model each sxj's share of total children/adults/seniors in the same way that e.g. the share of total transfers by sxj is modeled. Then we can multiply that by the current number of children/adults/seniors to get the average by sxj.

jdebacker commented 3 years ago

Yes - that is a good solution!

MaxGhenis commented 3 years ago

Some updates:

@prrathi tried the KDE with some PSID data, but it was still noisy because it's the quotient of a smoothed numerator (# kids in bin) and unsmoothed denominator (# families in bin). He's going to try smoothing the denominator too.

Given the PSID data issues described in #28, we tried returning to the taxdata CPS file in this notebook, and using stratified LOESS. Here's the raw data for 18-64: image To avoid the jumps, @prrathi is going to start with the counts excluding the household head, then add the household head to the appropriate count based on their age post hoc.

Here's the LOESS smoother with the 18-64 bin, just for household head ages 18-64 to avoid smoothing that spike: image

And the residuals: image

We tried some different values of frac (essentially bandwidth, defaults to 0.67), and found that 0.4 avoided large sustained residuals while also avoiding too many inflection points which seem implausible.

If the KDE smoothing for the numerator and denominator doesn't work as well, this stratified LOESS seems pretty good (though a multivariate LOESS would be better). @rickecon fyi.