Open MaxGhenis opened 4 years ago
@MaxGhenis yep that's the gist of it. Here is a presentation I put together in 2018 that provides a general overview of statistical matching as well.
You're right that current code is hard to follow as well. Fixing that is on my to-do list when refactoring. And I think it's about time we made a big taxdata documentation push. All of our docs are all over the place right now.
I need to understand the current statistical matching process to benchmark
synthimpute
's age imputation (#333). The current code has very few comments and lacks documentation, and I'm having trouble following it.It seems like the gist is that it first buckets records from the CPS and the PUF by a few variables [1], and then within each bucket matches records by predicted taxable income [2]?
[1] Matches on cells of
idept
(dependent) xijs
(?) xiagede
(senior?) xidepne
(dependent exemptions?) xpeople
xikids
(bucketed) xiself
(constant value of 9?)[2] Regression LHS is continuous versions of [1] and some others income features