PSLmodels / taxdata

The TaxData project prepares microdata for use with the Tax-Calculator microsimulation project.
http://pslmodels.github.io/taxdata/
Other
21 stars 30 forks source link

Document statistical matching process #358

Open MaxGhenis opened 4 years ago

MaxGhenis commented 4 years ago

I need to understand the current statistical matching process to benchmark synthimpute's age imputation (#333). The current code has very few comments and lacks documentation, and I'm having trouble following it.

It seems like the gist is that it first buckets records from the CPS and the PUF by a few variables [1], and then within each bucket matches records by predicted taxable income [2]?

[1] Matches on cells of idept (dependent) x ijs (?) x iagede (senior?) x idepne (dependent exemptions?) x people x ikids (bucketed) x iself (constant value of 9?)

[2] Regression LHS is continuous versions of [1] and some others income features

andersonfrailey commented 4 years ago

@MaxGhenis yep that's the gist of it. Here is a presentation I put together in 2018 that provides a general overview of statistical matching as well.

You're right that current code is hard to follow as well. Fixing that is on my to-do list when refactoring. And I think it's about time we made a big taxdata documentation push. All of our docs are all over the place right now.