PSLmodels / PUF-State-Distribution

MIT License
2 stars 1 forks source link

Fleshing out a maximum-entropy constrained NLP approach fully enough to allow implementation #3

Open donboyd5 opened 6 years ago

donboyd5 commented 6 years ago

Summary of previous relevant discussion at #2.

In #2 we discussed assigning PUF records to individual states, either based on probability that a record in the PUF actually is from a particular state or based on a measure of the record's (Euclidean) distance from summary characteristics of records from individual states, which is really a similar concept.

Probabilities might be estimated from other similar microdata that has state codes, using a multinomial logit approach. At the moment, we probably don't have data that are sufficient for this.

Distances might be estimated by comparison to summary state-level data such as those at https://www.irs.gov/statistics/soi-tax-stats-historic-table-2, which loses the richness of microdata but has the advantage of feasibility (we have the necessary data).

We discussed two further, related, issues:

  1. If we only assign each record to its highest-probability or closest-distance state, then each state will look like its average taxpayer and we will lose important variation that exists in the real world, because we do not include low-probability records. This is undesirable. We discussed two ways to avoid this: (a) Assign records to states randomly in a manner that makes it likely that records are assigned to high-probability states, but allows them to be assigned to low-probability states (this is the Stata code that Dan provided; the mechanism was assigning states to records rather than records to states, but it is the same thing). This assignment can be repeated multiple times. Or, (b) Distribute portions of records to states based upon probabilities (or distances), so that each record can be assigned to multiple states, with higher portions likely to go to the high probability (or low distance) states. This allows portions of low-probability records to be distributed to states, so that we get variation.

  2. Let's assume we have addressed the first point, either by multiple assignments of records to states, or by distribution of portions of records to states. We now have a file that in some sense is representative of the 50 states. It would have more than the initial number of, let's say, 150k records. If we used the assignment approach 10 times, it would have 1.5 million records. If we used the distribution approach and included all 50 states in each record's distribution, it would have 7.5 million records. The records generally would be consistent with characteristics of states, with variation. But there is no reason to believe this file would hit the targets we have for the 50 states, from the SOI summary data, although it should have moved in that direction.

For people who want a file that hits known/estimated totals (me), this is a problem to be solved. We talked about adjusting record weights from this point using a constrained NLP approach to ensure that targets are hit. Dan proposed a maximum-entropy objective function.

Is this an accurate summary? I'll propose some next steps but would love to see feedback first.