PSLmodels / taxdata

The TaxData project prepares microdata for use with the Tax-Calculator microsimulation project.
http://pslmodels.github.io/taxdata/
Other
19 stars 30 forks source link

Improve age variable on PUF #333

Open MattHJensen opened 4 years ago

MattHJensen commented 4 years ago

In a recent PSL call, we discussed improving the age variable on the PUF, which is currently brought over in the CPS match. This discussion followed a comment from @jdebacker about needing to use the CPS as a source of primary taxfiler data rather than the PUF in a recent report using OG-USA.

@MaxGhenis suggested the possibility of imputing age from the CPS rather than obtaining during the match.

I recently found a snippet on TPC's approach in this report at pg 180:

TPC uses cross-tabulations by age, fling status, and income provided by SOI to impute the ages of taxpayers and dependents to the LAPUF. TPC then performs a constrained statistical match between the LAPUF and the 2012 CPS.

Another snippet is available in the TPC model FAQ:

We use cross-tabulations of age, filing status, and income sources we obtained from SOI to implement a raking algorithm to impute the ages of taxpayers and their dependents on to the LAPUF.

The closest published cross-tabulations I could find from SOI are in Individual Complete Report (Publication 1304), Table 1.6, and the latest data is for 2017. But, "provided by," and "obtained from", sound like TPC may be using non-public data from SOI.

MaxGhenis commented 3 years ago

I applied synthimpute to impute age in the CPS here. Rather than going to the PUF, it imputes on a holdout set of the CPS for evaluation (I didn't check that the x's are in the PUF yet).

Average age is 0.33 years too high, and standard deviation is 3.2 years too low. If comparison procedures like matching could predict quantiles, I think quantile loss would be the ideal evaluation metric; at that point, it's just selecting a uniform random quantile. Could the matching return the values for the nearest k records, and consider those the quantile range?

In general, my hunch is that matching will understate the conditional variance (may overfit too), and this will probably result in lower total variance too, but that'll only be part of the full picture. Random forests also understated variance in this experiment, so we'll have to compare, but I'd expect it to do better. We can also add variance manually based on performance on holdouts (Tetlock has recommended this for forecasting in general, though I can't find his quote atm).

Raking would be a good complement to this, and it's implemented in Python at https://github.com/Dirguis/ipfn. I'm not sure whether it's better to rake before imputing as TPC did, or vice versa, given we only have age ranges to rake on, but either way we'd need to impute also.