Open donboyd5 opened 6 years ago
Ernie asked for code that does this. Here is the code Dan provided (https://github.com/open-source-economics/taxdata/issues/138#issuecomment-353685059).
I only have Stata code. The variables below (cumulativ, onstate, etc) are vectors with an element for each record. p1 through p51 are the estimated probability of a record being in each state. r is a random value uniform on 0-1.
gen cumulative = 0
gen byte onestate = 1
quietly {
forvalues state = 1/51 {
replace cumulative = cumulative + pstate' replace cum
state' = cumulative
replace onestate = state' + 1 if cum
state' < r
}
}
quietly {
forvalues state = 1/52 {
foreach v in vars' { generate p
v'state' =
v'*p`state'
}
}
It should be straightforward to do this in R or any language.
dan
Dan,
Thanks. A few questions:
I presume the probabilities come from a multinomial logit, similar to some of our prior discussions? Or did you generate them some other way?
The cumulative method would seem to favor whatever states were ordered earliest (e.g., alpha order), since you make the state assignment (in essence) as soon as you've accumulated probabilities for a given record that sum to at least the value of r, the uniform random variable. Is my interpretation correct? If so, does that concern you? If so, would it make sense, for each record, to order the states by descending probability? (The ordering of states would differ from record to record.)
I don't understand what (or why) you're doing in your second step. It looks like you're taking each variable (such as wages?) and creating 52 new variables, one for each state, with the amount of the variable given to the state based upon that state's probability of being chosen. Am I interpreting this correctly? I don't follow why you do this. Or are the "vars" different weight variables, and you are distributing them to states according to the probabilities, in which case this would be an alternative to the single-state assignment?
Let's say you did the single-state assignment based upon probabilities. Let's say you did it 10 times, each with different r values, and/or doing it without replacement so that a state assigned to a record previously could not be assigned in a later step to the same state. This would yield a file with 1.5m records (if we started with 150k records), with the most populous states generally having the most records (right?). (Perhaps we did it 10 times because we wanted to ensure that the smaller states had "enough" records, somehow defined.)
If we did that, and we summarized the file (after scaling weights appropriately), we would find that our sums for each state and income range do not hit our desired targets, although they might not be terribly far off because we based them on reasonably estimated probabilities. Those of us who want a file that hits those targets (certainly me) would want a second stage, where we adjusted weights to hit the targets. So I think of the approach you have defined above as a first stage, similar in objective to my first stage (a simple scaling of weights), but instead using probabilities. But we still need a second stage. Presumably we like the weights we have obtained in the first stage so we would adjust the weights in a way that minimizes a penalty function based on how far the weights move from the first-stage weights. Do you agree?
On Sat, 23 Dec 2017, Don Boyd wrote:
Dan,
Thanks. A few questions:
- I presume the probabilities come from a multinomial logit, similar to some of our prior discussions?
Yes. I ran a logit on the <200K taxpayers and then used that to impute a state of residence. I then tested the state aggregates for a few variables such as tax tax paid on the <200K and >200K taxpayers separately. It did well for the estimation sample, and poorly for the >200K sample. So I do not propose to use the coefficients from 2008 on the 2011 PUF. That is why I am working on the MaxEnt procedure, and also why I have applied to SOI to run the logit regressions on recent confidential data. I have heard back from Barry that he intends to respond favorably to that application, but hasn't done anything definite.
dan
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.[AHvQVdFsVDVd8e4JdMzLK2B35XnAWpJ7ks5tDN46gaJpZM4RLo9v.gif]
Yes, unfortunately, the fact that we don't have state codes on the newer files means that we can't estimate probabilities that way. What do you think of approaches that try to construct a measure of how "close" a record is to a particular state?
Let's take the $50-75k income range as an example:
Using the SOI summary data, get measures of what average returns in this range look like in each state. Get % of returns that have wages, average wages of those who have; % that have capgains, average cap gains of those who have; % that have interest income, average interest income of those who have, % who had SALT deduction, average SALT deduction of those who have, and so on.
For each of the (approx) 40k PUF records in this range, compute Euclidean distances from each of the 50 states on the 20 or so attributes defined in step 1. (If the record has no SALT deduction, then the SALT-proportion component of its distance will be farther from NY, where let's say the proportion is 30% than from MS, where let's say the proportion is 10%. If the record has cap gains and the amount is great, then the 2 capgains components of the distance will be closer to NY, where the cap gains % is, let's say, 20% and the capgains average is high, than it is to MS.) We end up with 50 distances for each of 40k records.
This has information that, I would argue, is meaningful - information that max entropy would not have.
The question, then, is what to do with this information?
One approach would be to use it in a first stage, similar to the probability approach Dan outlined above, to assign a state to each record (perhaps multiple times, as he notes). We would (or at least I would) still need a 2nd stage, to adjust weights to hit targets - an NLP approach.
Another approach is to use the distance measures as a component of the objective function (the penalty function to be minimized) so that distributing weight to a low-distance state is penalized less than distributing weight to a high-distance state.
Thoughts?
On Sat, 23 Dec 2017, Don Boyd wrote:
Yes, unfortunately, the fact that we don't have state codes on the newer files means that we can't estimate probabilities that way. What do you think of approaches that try to construct a measure of how "close" a
I assume by "that way" you mean logit.
record is to a particular state?
Let's take the $50-75k income range as an example:
1.
Using the SOI summary data, get measures of what average returns in this range look like in each state. Get % of returns that have wages, average wages of those who have; % that have capgains, average cap gains of those who have; % that have interest income, average interest income of those who have, % who had SALT deduction, average SALT deduction of those who have, and so on.
You want to use the count and amount for each of a dozen or more state by income class aggregates. That will be the basis of any imputation with no help from SOI.
2.
For each of the (approx) 40k PUF records in this range, compute Euclidean distances from each of the 50 states on the 20 or so attributes defined in step 1. (If the record has no SALT deduction, then the SALT-proportion component of its distance will be farther from NY, where let's say the proportion is 30% than from MS, where let's say the proportion is 10%. If the record has cap gains and the amount is great, then the 2 capgains components of the distance will be closer to NY, where the cap gains % is, let's say, 20% and the capgains average is high, than it is to MS.) We end up with 50 distances for each of 40k records.
This has information that, I would argue, is meaningful - information that max entropy would not have.
MaxEnt has all of that information if it is in the constraints. We want to maximize the entropy of the probability assignments constrained by the necessity of matching the published aggregates for multiple variables. So the information is used.
The question, then, is what to do with this information?
One approach would be to use it in a first stage, similar to the probability approach Dan outlined above, to assign state to records (perhaps multiple times). We would (or at least I would) still need a 2nd stage, to adjust weights to hit targets - an NLP approach.
You don't say how you would translate Eucluidian distances into probabilities. Any way you do it, if you assign taxpayers to the state most like themselves, the variation within a state will be attenuated. Each state will have taxpayers that look like the average for that state. The MaxEnt approach ensures that each state has the amount of variation that the data implies. Since we don't have specific information on the within state variation, the within state variation is that which results when no additional information beyond the taxpayer data and state level aggregates is imposed on the result. Use the minimum Euclidian distance imposes a strong requirement on the result.
Another approach is to use the distance measures as a component of the objective function (the penalty function to be minimized) so that distributing weight to a low-distance state is penalized less than distributing weight to a high-distance state.
MaxEnt makes the best estimate of the proportion of good to bad taxpayer to state matches that can be made without additional information.
dan
Thoughts?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.[AHvQVeVYYdnXE_1hIa__9MR_D3lgcXWqks5tDPHYgaJpZM4RLo9v.gif]
I've created a new issue to take up the question of how to implement a maximum-entropy constrained NLP approach, as it is really not about assigning a state randomly to a record (this issue), and deserves a full discussion: https://github.com/open-source-economics/PUF-State-Distribution/issues/3.
Before that, a few quick comments back:
You want to use the count and amount for each of a dozen or more state by income class aggregates. That will be the basis of any imputation with no help from SOI.
I want to start by creating targets (constraints) based upon counts and amounts for aggregates from the publicly available available SOI summaries available at https://www.irs.gov/statistics/soi-tax-stats-historic-table-2. That provides a rich set of targets. It's not that I don't want help from SOI. If they (the people at SOI) can provide better information for targeting, or if you, through your work with them, can develop better information for targeting, then that is a big plus. I am a believer in iterative refinement: We should start by doing the best we can with what we have now. If we can get better information from the people at SOI, then one of the iterative improvements would be to incorporate that information when available.
You don't say how you would translate Eucluidian distances into probabilities. Any way you do it, if you assign taxpayers to the state most like themselves, the variation within a state will be attenuated.
Yes, that's true of any approach that assigns records to states based upon probability, rather than distributing portions of records to states (the latter allowing portions of low-probability records to be distributed to a state, the former not), isn't it, including the approach you outlined? My intent, however we defined probabilities or distance, would be to distribute portions of records to states, rather than to uniquely assign records to a state, to avoid this problem.
On Sun, 24 Dec 2017, Don Boyd wrote:
I've created a new issue to take up the question of how to implement a maximum-entropy constrained NLP approach, as it is really not about assigning a state randomly to a record (this issue), and deserves a full discussion: #3.
Before that, a few quick comments back:
You want to use the count and amount for each of a dozen or more state by income class aggregates. That will be the basis of any imputation with no help from SOI.
I want to start by creating targets (constraints) based upon counts and amounts for aggregates from the publicly available available SOI summaries available at https://www.irs.gov/statistics/soi-tax-stats-historic-table-2. That provides a rich set of targets. It's not that I don't want help from SOI. If they (the people at SOI) can provide better information for targeting, or if you, through your work with them, can develop better information for targeting, then that is a big plus. I am a believer in
I have no confidence that Barry will come through on his promise, so I would never suggest we wait on him. Eventually we might have his cooperation.
iterative refinement: We should start by doing the best we can with what we have now. If we can get better information from the people at SOI, then one of the iterative improvements would be to incorporate that information when available.
You don't say how you would translate Euclidean distances into probabilities. Any way you do it, if you assign taxpayers to the state most like themselves, the variation within a state will be attenuated.
Yes, that's true of any approach that assigns records to states based upon probability, rather than distributing portions of records to states (the latter allowing portions of low-probability records to be be distributed to a state, the former not), isn't it, including the approach you outlined? My intent, however we defined probabilities or distance, would be to distribute portions of records to states, rather than to uniquely assign records to a state, to avoid this problem.
I guess my response is that I expect the best way to distribute the weights will be given by MaxEnt, unless some additional information is available.
dan
I am moving Dan's comments on this point to here. Here is his initial comment (https://github.com/open-source-economics/taxdata/issues/138#issuecomment-353685059).
It is possible to assign a single state to each record in an unbiased manner. The way I have done this is to calculate a probability of a record being in each of the 50 states, and assign it to one of those state in proportion to those probabilities. That is, if a record has high state income tax the procedure will show high probabilities for New York, California, etc and low probabilities (but not zero) for Florida and Texas. Then the computer will select New York or California with high probability and Florida or Texas with low probability. In expectation the resulting totals will be the same as the "long" format but with some unbiased error. I have done this and find that state level aggregates match nearly as well as summing over all possible states. If desired, one could take 2 draws, or any other number. It would not be necessary to multiply the workload by 51.