PSLmodels / PUF-State-Distribution

MIT License
2 stars 1 forks source link

Assigning a state to a record, randomly, based on probability of the record being from particular states (versus distributing portions of records to states) #2

Open donboyd5 opened 6 years ago

donboyd5 commented 6 years ago

I am moving Dan's comments on this point to here. Here is his initial comment (https://github.com/open-source-economics/taxdata/issues/138#issuecomment-353685059).

It is possible to assign a single state to each record in an unbiased manner. The way I have done this is to calculate a probability of a record being in each of the 50 states, and assign it to one of those state in proportion to those probabilities. That is, if a record has high state income tax the procedure will show high probabilities for New York, California, etc and low probabilities (but not zero) for Florida and Texas. Then the computer will select New York or California with high probability and Florida or Texas with low probability. In expectation the resulting totals will be the same as the "long" format but with some unbiased error. I have done this and find that state level aggregates match nearly as well as summing over all possible states. If desired, one could take 2 draws, or any other number. It would not be necessary to multiply the workload by 51.

donboyd5 commented 6 years ago

Ernie asked for code that does this. Here is the code Dan provided (https://github.com/open-source-economics/taxdata/issues/138#issuecomment-353685059).

I only have Stata code. The variables below (cumulativ, onstate, etc) are vectors with an element for each record. p1 through p51 are the estimated probability of a record being in each state. r is a random value uniform on 0-1.

gen cumulative = 0 gen byte onestate = 1 quietly { forvalues state = 1/51 { replace cumulative = cumulative + pstate' replace cumstate' = cumulative replace onestate = state' + 1 if cumstate' < r } }

quietly { forvalues state = 1/52 { foreach v in vars' { generate pv'state' =v'*p`state' } }

It should be straightforward to do this in R or any language.

dan

donboyd5 commented 6 years ago

Dan,

Thanks. A few questions:

feenberg commented 6 years ago

On Sat, 23 Dec 2017, Don Boyd wrote:

Dan,

Thanks. A few questions:

  • I presume the probabilities come from a multinomial logit, similar to some of our prior discussions?

Yes. I ran a logit on the <200K taxpayers and then used that to impute a state of residence. I then tested the state aggregates for a few variables such as tax tax paid on the <200K and >200K taxpayers separately. It did well for the estimation sample, and poorly for the >200K sample. So I do not propose to use the coefficients from 2008 on the 2011 PUF. That is why I am working on the MaxEnt procedure, and also why I have applied to SOI to run the logit regressions on recent confidential data. I have heard back from Barry that he intends to respond favorably to that application, but hasn't done anything definite.

dan

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.[AHvQVdFsVDVd8e4JdMzLK2B35XnAWpJ7ks5tDN46gaJpZM4RLo9v.gif]

donboyd5 commented 6 years ago

Yes, unfortunately, the fact that we don't have state codes on the newer files means that we can't estimate probabilities that way. What do you think of approaches that try to construct a measure of how "close" a record is to a particular state?

Let's take the $50-75k income range as an example:

  1. Using the SOI summary data, get measures of what average returns in this range look like in each state. Get % of returns that have wages, average wages of those who have; % that have capgains, average cap gains of those who have; % that have interest income, average interest income of those who have, % who had SALT deduction, average SALT deduction of those who have, and so on.

  2. For each of the (approx) 40k PUF records in this range, compute Euclidean distances from each of the 50 states on the 20 or so attributes defined in step 1. (If the record has no SALT deduction, then the SALT-proportion component of its distance will be farther from NY, where let's say the proportion is 30% than from MS, where let's say the proportion is 10%. If the record has cap gains and the amount is great, then the 2 capgains components of the distance will be closer to NY, where the cap gains % is, let's say, 20% and the capgains average is high, than it is to MS.) We end up with 50 distances for each of 40k records.

This has information that, I would argue, is meaningful - information that max entropy would not have.

The question, then, is what to do with this information?

One approach would be to use it in a first stage, similar to the probability approach Dan outlined above, to assign a state to each record (perhaps multiple times, as he notes). We would (or at least I would) still need a 2nd stage, to adjust weights to hit targets - an NLP approach.

Another approach is to use the distance measures as a component of the objective function (the penalty function to be minimized) so that distributing weight to a low-distance state is penalized less than distributing weight to a high-distance state.

Thoughts?

feenberg commented 6 years ago

On Sat, 23 Dec 2017, Don Boyd wrote:

Yes, unfortunately, the fact that we don't have state codes on the newer files means that we can't estimate probabilities that way. What do you think of approaches that try to construct a measure of how "close" a

I assume by "that way" you mean logit.

record is to a particular state?

Let's take the $50-75k income range as an example:

1.

Using the SOI summary data, get measures of what average returns in this range look like in each state.
Get % of returns that have wages, average wages of those who have; % that have capgains, average cap
gains of those who have; % that have interest income, average interest income of those who have, % who
had SALT deduction, average SALT deduction of those who have, and so on.

You want to use the count and amount for each of a dozen or more state by income class aggregates. That will be the basis of any imputation with no help from SOI.

2.

For each of the (approx) 40k PUF records in this range, compute Euclidean distances from each of the 50
states on the 20 or so attributes defined in step 1. (If the record has no SALT deduction, then the
SALT-proportion component of its distance will be farther from NY, where let's say the proportion is 30%
than from MS, where let's say the proportion is 10%. If the record has cap gains and the amount is
great, then the 2 capgains components of the distance will be closer to NY, where the cap gains % is,
let's say, 20% and the capgains average is high, than it is to MS.) We end up with 50 distances for each
of 40k records.

This has information that, I would argue, is meaningful - information that max entropy would not have.

MaxEnt has all of that information if it is in the constraints. We want to maximize the entropy of the probability assignments constrained by the necessity of matching the published aggregates for multiple variables. So the information is used.

The question, then, is what to do with this information?

One approach would be to use it in a first stage, similar to the probability approach Dan outlined above, to assign state to records (perhaps multiple times). We would (or at least I would) still need a 2nd stage, to adjust weights to hit targets - an NLP approach.

You don't say how you would translate Eucluidian distances into probabilities. Any way you do it, if you assign taxpayers to the state most like themselves, the variation within a state will be attenuated. Each state will have taxpayers that look like the average for that state. The MaxEnt approach ensures that each state has the amount of variation that the data implies. Since we don't have specific information on the within state variation, the within state variation is that which results when no additional information beyond the taxpayer data and state level aggregates is imposed on the result. Use the minimum Euclidian distance imposes a strong requirement on the result.

Another approach is to use the distance measures as a component of the objective function (the penalty function to be minimized) so that distributing weight to a low-distance state is penalized less than distributing weight to a high-distance state.

MaxEnt makes the best estimate of the proportion of good to bad taxpayer to state matches that can be made without additional information.

dan

Thoughts?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.[AHvQVeVYYdnXE_1hIa__9MR_D3lgcXWqks5tDPHYgaJpZM4RLo9v.gif]

donboyd5 commented 6 years ago

I've created a new issue to take up the question of how to implement a maximum-entropy constrained NLP approach, as it is really not about assigning a state randomly to a record (this issue), and deserves a full discussion: https://github.com/open-source-economics/PUF-State-Distribution/issues/3.

Before that, a few quick comments back:

You want to use the count and amount for each of a dozen or more state by income class aggregates. That will be the basis of any imputation with no help from SOI.

I want to start by creating targets (constraints) based upon counts and amounts for aggregates from the publicly available available SOI summaries available at https://www.irs.gov/statistics/soi-tax-stats-historic-table-2. That provides a rich set of targets. It's not that I don't want help from SOI. If they (the people at SOI) can provide better information for targeting, or if you, through your work with them, can develop better information for targeting, then that is a big plus. I am a believer in iterative refinement: We should start by doing the best we can with what we have now. If we can get better information from the people at SOI, then one of the iterative improvements would be to incorporate that information when available.

You don't say how you would translate Eucluidian distances into probabilities. Any way you do it, if you assign taxpayers to the state most like themselves, the variation within a state will be attenuated.

Yes, that's true of any approach that assigns records to states based upon probability, rather than distributing portions of records to states (the latter allowing portions of low-probability records to be distributed to a state, the former not), isn't it, including the approach you outlined? My intent, however we defined probabilities or distance, would be to distribute portions of records to states, rather than to uniquely assign records to a state, to avoid this problem.

feenberg commented 6 years ago

On Sun, 24 Dec 2017, Don Boyd wrote:

I've created a new issue to take up the question of how to implement a maximum-entropy constrained NLP approach, as it is really not about assigning a state randomly to a record (this issue), and deserves a full discussion: #3.

Before that, a few quick comments back:

  You want to use the count and amount for each of a dozen
  or more state by income class aggregates. That will be the
  basis of any imputation with no help from SOI.

I want to start by creating targets (constraints) based upon counts and amounts for aggregates from the publicly available available SOI summaries available at https://www.irs.gov/statistics/soi-tax-stats-historic-table-2. That provides a rich set of targets. It's not that I don't want help from SOI. If they (the people at SOI) can provide better information for targeting, or if you, through your work with them, can develop better information for targeting, then that is a big plus. I am a believer in

I have no confidence that Barry will come through on his promise, so I would never suggest we wait on him. Eventually we might have his cooperation.

iterative refinement: We should start by doing the best we can with what we have now. If we can get better information from the people at SOI, then one of the iterative improvements would be to incorporate that information when available.

  You don't say how you would translate Euclidean distances
  into probabilities. Any way you do it, if you assign
  taxpayers to the state most like themselves, the variation
  within a state will be attenuated.

Yes, that's true of any approach that assigns records to states based upon probability, rather than distributing portions of records to states (the latter allowing portions of low-probability records to be be distributed to a state, the former not), isn't it, including the approach you outlined? My intent, however we defined probabilities or distance, would be to distribute portions of records to states, rather than to uniquely assign records to a state, to avoid this problem.

I guess my response is that I expect the best way to distribute the weights will be given by MaxEnt, unless some additional information is available.

dan