Adding state identification to the PUF

One of the most frequent requests that I have heard from users, especially during the TCJA debate, has been for the capability to analyze the impact of federal tax reform by state.

Don Boyd (@donboyd5) and I have been discussing an approach to do this, and he has given me permission to move our conversation onto GitHub. I will reproduce the conversation to date in the next comment.

Another advantage of the approach that we discuss is that it would provide a PUF-based dataset that could support for state-level calculators.

From @donboyd5

I'll try to pull together my thoughts, notes, and programs on the state-assignment issue. Let me give a few overview thoughts now.

First, I use the term state distribution rather than state assignment to reflect the fact that a return might be distributed to several states, partially (e.g., 20% to this state, 80% to that state), rather than assigned entirely to one state.

There are 2 approaches I can think of: (1) the one I have been using - distribute selected returns to all states, and (2) the one the Urban Institute has been using - distribute all returns to all states. I don't have an a priori opinion on which is better, but their's does create a large file and I like parsimony.

1) My approach - distribute selected returns to all states (let's say 50 states and 70k returns) Two major steps (many substeps): Step a) Distribute PUF un-state-coded returns to the 50 states. a.i) Chose the the proportion of each return to be distributed to each state, such that: a.ii) Each such proportion is in [0, 1] a.iii) The sum of proportions for every return is 1 (i.e., there are 70k adding-up constraints) a.iv) The results hit known/estimated targets for each state for maybe 20 variables (e..g, # of returns with cap gains, total dollar of gains, # of returns with SALT deduction, total SALT deduction) based on published SOI aggregates. These are constraints to be satisfied. There might be 1,000 such constraints (20 constraints per state, 50 states) a.v) The proportions are chosen to optimize some objective function intended to minimize "distortion" from these distributions or that has other nice qualities. (Dan F. likes entropy maximization. I like something else. I have done it both ways. It is worth discussing, but it is a single technical choice within a much larger methodology.)

This is therefore a nonlinear program (NLP). It is rather large. I gave you 2 different dimensions on the phone. The first was correct, the 2nd wasn't. It has about 3.5 million variables (one proportion for each of 70k returns x 50 states) and 71k constraints. It can use a lot of memory but there are software solutions to that problem. It can be hard to solve. But it is solvable.

Step b) Supplement the already-state-coded returns (the other maybe 90k returns - I don't remember how many). Turns out you can't really hit SOI aggregates for the 50 states for some variables based on the already-state-coded returns. What I did is beef up the number of state-coded returns for each state and income range by pulling in copies of similar returns from other states, based on a distance function. Then I adjusted weights to hit known SOI values by state and income range, using an NLP approach similar to that in Step a. There are a lot of important details involved in doing this in a workable way.

2) Urban Institute approach For each state, the reweight the PUF so that the weighted returns hit known targets for each income range, without paying any attention to state codes that are on the file. I think they use an LP or NLP approach to reweighting. Assuming the numbers above (160k PUF returns), this results in 160k returns per state, or 8 million returns - many of which have zero or near-zero weight. Personally, I would weed out near-zero returns and reweight again.

I don't know which approach is better. Mine says, in Step b, that there is useful information in the available state coding and we should use it, supplemented. Theirs says there is no value in state codes, and we don't care how much we use any one return.

I think the two approaches are worth some debate, and some analytic comparisons.

There are many other important issues as you forecast from a known year to the unknown future (e.g., what to do about reported pass-through income in Kanas?).

From me:

Thank you — I am also quite excited. State distribution would satisfy a large fraction of our user requests that currently go unsatisfied.

We will be working with PUFs from 2009 on, none of which have state codes for any records. Is it right to think that the Urban Institute approach will be our only option?

From Don:

My initial reaction is yes, as long as you don't want to impose any constraints such as requiring that returns used to represent one state cannot also be used to represent another state. Under these circumstances (no state codes at all), I don't think I see a reason to impose such a constraint.

My own sense of tidiness would make me want to have a smaller file than Urban has, rather than using all returns for each state. You'll probably end up with a lot of records for a given state that represent a tiny fraction of a person. It might be worth some intellectual and experimentation effort to investigate ways to drop a lot of these super-small returns.

All of this is worth discussion.

I think if you don't have any state codes, and you go in this direction, the problem should become quite easy computationally. It might go something like this (let's say there are 150k records in the PUF):

For each state X:

Start with all 150k PUF records

Scale their weights proportionately so that the weights add to the number of returns in the state as reported in SOI summary data (e.g., if California is, say, 1/6 of the number of returns in the U.S., then divide each record weight by 6). Possibly do this for subgroups for which we have SOI summary data on # of returns - for example, for each of 5 income ranges by 2 marital classes (married-joint and other). Thus, the scaled file would have the right number of weighted total returns in each of, say, 10 mutually exclusive categories. (The weighted totals for income, deductions, etc., would be way off.)

Choose 150k record-specific scaling factors, to be multiplied by each record weight, such that when weights are adjusted, we hit known SOI targets for the state. The targets might be items such as, for each of 5 income ranges: num married returns, num other returns (already correct at this stage) total agi, total wages, total cap gains, total itemized deductions, total SALT deduction num returns with wages, num returns with cap gains, num itemizers, num SALT takers and so on - maybe 20 or 30 of these

Call the factors we are choosing x[i] where i runs from 1 to 150k Choose these x[i] such that each x[i] >=0 and each x[i]<= some upper bound, such as 10 (150k bound constraints)

Establish constraints that ensure the targets are hit (within tolerances). For example, if there is $15 billion of agi in the $50-75k agi range in state X, then the constraint looks like:

lefthand side:

sum x[i] w[i] {agi[i] >= 50e3 & agi[i] < 75e3) over i

and the right hand side is

15e9

We choose the x's to satisfy these constraints and bounds, in a way that optimizes some penalty function. This is tricky because we don't really have a priori knowledge of what the x's should be. One penalty function would be that we want the x's to be near 1 -- so that the adjusted weights should be near the weights we created in step 2.

If so, then we might penalize differences from 1. One objective function (to be minimized) would be:

sum (x[i] - 1) ^2 over i

but others could be of interest. (As noted, Dan Feenberg likes entropy maximization.)

There is some art in this step because some targets could be hard to hit.

After doing steps 1-3, we'd have a reweighted file with 150k records that hits the targets for state X. But a lot of the new weights would be VERY small (e.g., when doing this for Mississippi, the weights for records with high SALT deductions might have very low weights). It might be worth considering reasonable ways to pare down the number of records - possibly dropping a lot of records and repeating steps 1-3. My guess is that you could often get down to much smaller numbers of records for each state. Might be worthwhile.

All of this is worth debate.

From me:

Could you verify that my understanding is correct?

we will be adding ~51 columns to the database representing the state weights (or, alternatively, they could contain scaling factors to be applied to the standards weights to obtain state weights)

If we don’t ‘drop' the records for a given state that represent a tiny fraction of a person, then all records will have a non-zero value in each of those ~51 cols.

If we do ‘drop' the records for a given state that represent a tiny fraction of a person, then some records will have zeroes in one or more of those ~51 cols.

One advantage of dropping the records is that tabbing federal revenues by state might be somewhat faster to perform.

A more significant advantage of dropping the records is that state calculators could, potentially, be significantly more efficient as they could ignore the 0 weight rows.

From Don:

we will be adding ~51 columns to the database representing the state weights (or, alternatively, they could contain scaling factors to be applied to the standards weights to obtain state weights)

Yes, if done the way Urban Institute does it. I think of this as a "wide" file.

An alternative is to have a "long" file where there are 51 times as many records -- one set of records for each state, with different weights or adjustment factors for each state. The potential advantage of this is you could then change not just the weights but the actual record values (e.g., agi or the SALT deduction or whatever). Not sure this would ever be needed but worth discussion. For example, what do we do if we are projecting data for Kansas forward from the latest PUF year, and we have their state income tax law that encouraged people to recharacterize what might have been wage income as, instead, pass-through income. IF we had appropriate targets, we might find that the only sensible way to hit them was not just to adjust weights, but even (perhaps) to adjust values so that some records have more pass-through income and less wages. Maybe. I don't know. But at some point, a good topic of discussion.

And more generally, as you move from: a) the first stage, where you are adjusting the PUF to hit known SOI aggregates from the PUF year, to b) the second stage, where you are projecting records to a future year, with no known targets (but with targets you may estimate), you may want the flexibility to adjust not just the weights on records, but also the values. In that case you may no longer have (or want to have) identical records for every state, with different weights; you might want different values, too. At this point, it probably makes sense to move from the wide (51 columns) format to the long format (51 times as many records, each with a state code).

If we don’t ‘drop' the records for a given state that represent a tiny fraction of a person, then all records will have a non-zero value in each of those ~51 cols.

Not necessarily. Depends on optimization choices: (a) bounds on the weight adjustment factors, and (b) the objective function.

Call the weight adjustment factors, x[i, j] where (i in 1:150k, if there are 150k records in the PUF now, and j in 1:51 if there are 51 "states"). Then, if:

a) the variable bounds are set so that x[i, j] in [0, large number] for all i, j, (i.e., a weight adjustment factor is allowed to be zero), and

b) the objective function does not penalize x[i, j]==0 excessively, then some (or even a lot, depending on the objective function) of the x[i, j] could be zero. With Dan Feenberg's objective function, all x[i, j] will be nonzero. (In his function, when distributing returns across states, we are minimizing the sum over i, j of x[i, j]*ln(x[i, j], and since we can't take natural logs of zero, we can't have x[i, j]==0 for any i, j.) But with other objective functions, a lot of the x[i, j]'s could be zero.

If we do ‘drop' the records for a given state that represent a tiny fraction of a person, then some records will have zeroes in one or more of those ~51 cols.

Yes, by definition. One advantage of dropping the records is that tabbing federal revenues by state might be somewhat faster to perform. Yes, although I don't think computation cost will matter much for most kinds of income tax models.

A more significant advantage of dropping the records is that state calculators could, potentially, be significantly more efficient as they could ignore the 0 weight rows.

Yes, but again, computation cost should not be a big deal.

I think the real questions are tidiness and communication:

Tidyness: Does it bother us that we have a lot of records - perhaps tens of thousands - that are tiny fractions of a person and that in aggregate may not even add up to a single person? Do we want to say that this tax overhaul option will result in 0.005 people who are losers, or whatever? I am not sure. For some reason it bothers me, but maybe it shouldn't.

Communication: Even if it does not bother us, will it be hard to explain it to some audiences, and to gain their confidence in the results? If so, we could just not make a big deal of it. Or we could consider whether we could have equally valid results from a more parsimonious sample (i.e., with some zero-weight records, which are then dropped).

From me:

These answers are very helpful.

I am convinced by the value of the “long” file approach.

I am not so ready to write off computation costs as a consideration since some of our users’ applications call Tax-Calculator many times. (see, e.g., this discussion). I’d add efficiency as another benefit in favor of parsimony!

Do you mind if I move this discussion onto GitHub so that others can participate? I would just need a GitHub handle for you.

From Don:

Yes, I am a lover of parsimony and efficiency. I have spent days making computer programs more efficient (I guess we can debate whether spending that much of my time is true efficiency, but I find that efficiency has many benefits, including greater readability, less likelihood of error). And if we find reasons to do stochastic runs - for example, running 51 state models 1,000 times, then efficiency will definitely be computationally important.

Moving to github would be great; my username is donboyd5.

Whew!

Adding a help-wanted label. It would be great for someone to implement this or a similar approach using a 2009 or later public use file. OSPC staff will likely have bandwidth to participate in the implementation the early new year, but not immediately.

Great. For whomever might be able to help, let me add a few broad comments worth thinking about.

1. In my experience, scaling up these kinds of problems almost always presents unanticipated obstacles and teaches us important lessons. For example, moving from a problem with 1,000 variables and 100 constraints to a problem with 3 million variables and 70 thousand constraints will present obstacles. Often these obstacles are great enough to bring a project to a halt, until they are solved.

The obstacles can include extreme memory usage, extreme computational slowdown, numerical instability, and optimization software that simply chokes.

2. It is almost always better to try to get around these obstacles with more-appropriate software and smarter problem setup than with bigger hardware.

3. Often it is possible to break a single very large and hard-to-solve problem into many smaller independent problems, and it usually is better to do so.

For example, it is possible to "stack" 50 copies of a 160k-record PUF into a single 8-million record file with 50 sets of state codes, and then create, let's say, 20 targets (constraints) for each state and each of 4 income classes, so that we have 20 targets x 50 states x 4 income classes = 4,000 constraints. We could then look for 8 million weight-adjustment factors that satisfy these 4,000 constraints, while minimizing an objective function based upon those 8 million variables, all in one fell swoop. And we probably could solve it.

Alternatively, we could break this into 200 separate problems (50 states, 4 income classes) that are mutually exclusive, each of which has about 40k variables to be found (assuming, for simplicity, that one-fourth of the 160k records falls into each income range). This makes each problem much simpler to solve, and also makes it easy to identify any "problem" groups (state/income-range combinations) that aren't solving well, allowing the analyst to pay extra attention to such groups and to devise workarounds (that can be automated).

4. Software a) I have found that the best (in my experience) non-linear solver to address points 1 and 2 is open source Ipopt, with it relying upon the linear solvers MA77 and/or MA57, depending on specific application (these linear solvers are available to academics under a license that allows free use, as long as the terms of the license are followed). MA77 does much of its work out of memory and so scales very well to extremely large problems; MA57 generally will be faster on smaller problems.

b) Addressing point #3 requires flexible software that allows very efficient problem set up (because of the thousands of constraints -- too much to code by hand), that allows easy iteration through many problems (e.g., the 200 problems described in item 3), and that can call Ipopt and its linear solvers. I have used R plus the Ipoptr package (which is a pain to compile), but I know that Python and Julia both are capable (and Julia might be faster); my familiarity with Python and Julia are limited, but it does not look hard to do this with one of these languages rather than R.

I have found this combination to work extremely well on a moderately-sized Windows PC (AMD 8 cores, 32gb RAM; I suspect it could work with 16gb). It should be easy to port the problem to Linux.

If someone is able to help, I can share my (not necessarily well-organized) work.

Don

On Tue, Dec 19, 2017 at 1:22 PM, Matt Jensen notifications@github.com wrote:

Adding a help-wanted label. It would be great for someone to implement this or a similar approach using a 2009 or later public use file. OSPC staff will likely have bandwidth to participate in the implementation the early new year, but not immediately.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-source-economics/taxdata/issues/138#issuecomment-352843692, or mute the thread https://github.com/notifications/unsubscribe-auth/AGPEmBriixDpsc0snJLXyOaSbqKZJTefks5tB_7fgaJpZM4RHVR- .

Thanks Don. Just my two cents here:

Since the point of statistical matching is to be unbiased rather than precise (which is impossible here), Urban's general approach of ignoring state-coded records is fine provided the underlying methodology is sound. You probably know better whether your proposed approach or the Urban approach makes a meaningful difference in terms of processing time.
I agree that it's advisable to allow PUF records to map across multiple states. I also agree that zero-weighted records make no sense but, speaking personally, I'm OK with small-weighted records. They're the output of an unbiased process that's targeting administrative totals, so within reason -- i.e. unless the result is a massively-unwieldy file -- I'd prefer robustness over speed / tidiness.

I've done a tiny bit in this area attempting to match the PUF to the CPS, the ACS, and/or IRS summary files. I haven't tried the nonlinear methods you propose but happy to help out to the extent I can.

In addition to parsimony, is another rationale for nixing small weights a concern of overfitting? If so, LASSO regression comes to mind, which reduces risk of overfitting by including an optimization penalty involving the number of nonzero coefficients. Could there be an analog here?

For example, prediction errors on each state's objective function could be calculated via cross-validation, and this may reveal a benefit to penalizing or capping nonzero record-state weights.

Thanks, Ernie.

I think we probably won't get to explore the pros and cons of using state-codes on existing records vs. not using them. Matt tells me that the PUFs available now do not have the state codes (I have used older PUFs that have them), so as a practical matter we must use uncoded records. (I hate to throw away information so if we had codes, I'd want to try hard to use them. But we don't have to join that debate.)
Great. I'm agnostic on this, but I think it's something that we can work through as we work with actual data.

Don

As an aside, I use terminology slightly differently than you used in your note, as follows. There are two important tasks that microdata users often need/want to do:

a) "Target" a microdata file, typically by adjusting weights, so that aggregated weighted values are more-consistent with what we think the world looks like. That's what I'm talking about here: adjusting weights on individual records in a way that ensures that the aggregated reweighted file produces totals close to what we think is true. That involves choosing adjustment factors that satisfy constraints. There are many different sets of adjustments that might satisfy the constraints, so you need some method of choosing the best set of adjustments. That's where optimization comes in. Often we choose adjustments that minimize some measure of distortion, or a penalty measure

for example by penalizing large adjustments by a lot more than we penalize small adjustments. That can lead to a nonlinear objective function of the adjustments, that is then minimized. Commonly this would be a nonlinear function.

b) "Statistically matching" two different microdata files that purport to be (or, with enhancement, purport to be) from the same universe, but come from different samples. For example, we may have a sample of tax filers (i.e., the PUF), and a sample of consuming households (e.g., the Consumer Expenditure Survey) that, with some enhancement, might both be from the same universe (for example if we added nonfilers to the PUF, so that both purport to be describing the U.S. population). Perhaps we want a file with tax data (PUF-plus) and consumption expenditure data (CEX). But we can't link these files directly because they clearly are from different samples and no hard match is possible. One approach is to impute consumption expenditures to the tax records using regression-based approaches or something similar. However, it might be hard to have the right correlations among variables, and the totals (weighted consumption on the tax file) might be far from the known CEX totals. Another approach might be to "match" the two files "statistically" (although I think "statistical" is a misnomer) -- forcing all of the records in an adjusted CEX to map to all of the records to the adjusted PUF so that some fraction of each CEX record's weight is matched with 1 or more PUF records, so that all CEX and PUF record weights are exhausted. As a result, totals in the PUF-CEX matched file will equal the separate income totals in the PUF file and consumer expenditure totals in the CEX file. It can be structured as a giant transportation problem, or minimum cost network flow problem. (It is giant because of the number of possible matches. If you had 100k PUF records and 50k CEX records, then there are 5 billion possible matches. There are ways to reduce the problem size by defining acceptable matches as a subset of possible matches - don't match millionaires with low-income records, for example.) This approach has plenty of problems, but also can have value.

In any event, in my comments that were posted, I've been referring to what I call targeting ("a") and not what some people call statistical matching ("b").

On Tue, Dec 19, 2017 at 2:10 PM, evtedeschi3 notifications@github.com wrote:

Thanks Don. Just my two cents here:

1.

Since the point of statistical matching is to be unbiased rather than precise (which is impossible here), Urban's general approach of ignoring state-coded records is fine provided the underlying methodology is sound. 2.

I agree that it's advisable to allow PUF records to map across multiple states. I also agree that zero-weighted records make no sense but, speaking personally, I'm OK with small-weighted records. They're the output of an unbiased process that's targeting administrative totals, so within reason -- i.e. unless the result is a massively-unwieldy file -- I'd prefer robustness over speed / tidiness.

I've done a tiny bit in this area attempting to match the PUF to the CPS, the ACS, and/or IRS summary files. I haven't tried the nonlinear methods you propose but happy to help out to the extent I can.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-source-economics/taxdata/issues/138#issuecomment-352857082, or mute the thread https://github.com/notifications/unsubscribe-auth/AGPEmDdefe7zX8vcKf8AX7s916SHFrumks5tCAocgaJpZM4RHVR- .

I am sorry to say I don't know enough about LASSO regression to have an intelligent thought on this but am interested in seeing others' responses. Happy to learn more.

Don

On Tue, Dec 19, 2017 at 2:57 PM, Max Ghenis notifications@github.com wrote:

In addition to parsimony, is another rationale for nixing small weights a concern of overfitting? If so, LASSO regression comes to mind, which reduces risk of overfitting by including an optimization penalty involving the number of nonzero coefficients. Could there be an analog here?

For example, prediction errors on each state's objective function could be calculated via cross-validation, and this may reveal a benefit to penalizing or capping nonzero record-state weights.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-source-economics/taxdata/issues/138#issuecomment-352869297, or mute the thread https://github.com/notifications/unsubscribe-auth/AGPEmAFSFfIDLUnOylluuCdgU2x8TenWks5tCBUrgaJpZM4RHVR- .

To clarify — and pardon my ignorance in all of this — how is the initial state assignment done? Randomly?

Also, in this approach, how does the algorithm decide whether or not to split records across states? It seems like with continuous weights there’s an infinite ability to split records.

Not at all an ignorant question. Please don't take what I describe below as THE way to do it. It is A way to do it that makes sense to me, subject to criticism, discussion, etc.

Let's take the simple case we are dealing with now (I believe), where we want to take the (let's call it) 150k PUF returns and give them weights that make them represent a specific state - let's say, California.

(We're not trying to do anything fancy such as constrain the weights across states so that each record 's weight is used exactly - no more and no less. To be clear about what we're NOT doing, in this illustration, we are not taking, for example, record number 1,017 that has a weight of 100, and distributing that weight in such a way that it is forced to be used up exactly, such as giving a weight of 33 to California, 27 to New Jersey, and 40 to New York. Instead, in what I describe below, we can use as much or as little of record number 1,017 as we want.)

So, we have 150k returns and we want to weight them in a way that they represent what we know about California. We could do the following (all numbers are made up):

Stage 1 - get an initial set of weights: We look to the SOI summary data and we see that CA has 24 million federal returns, and the PUF has 240 million returns (sum of weights). So we divide every weight on the file by 10. Or, we do it in more detail - we look at mutually exclusive groups in the SOI and we see the number of federal CA returns in each of 5 income groups and each of 2 marital statuses (married-joint, other). We then calculate appropriate ratios of CA SOI returns to US PUF weight-sums in each of these 10 mutually exclusive groups and we calculate initial weights for each record.

For example, if the weighted number of US PUF married-joint returns in the $50-75k agi group is 50 million, and if the SOI summary data show that California had 4 million federal married-joint returns in the $50-75k agi range, then the ratio for this group would be 4/50 or .08. We would multiply the weight of every married joint record in the US PUF in the $50-75k agi range by 0.08.

The result of this step would be a file with 150k returns that, when aggregated using the new weights, would have the right number of CA federal returns in each income range and marital status grouping (the 10 groups). Of course, income, deductions, and other items would be wrong. Income would be too low, SALT deductions certainly would be too low, and so on. But it is a starting point.

Stage 2 - adjust the initial weights so that we hit a full set of known targets for CA We go back to the SOI summary data and form a much fuller set of targets -- perhaps everything we can target based on available data. This might include, for EACH of the 5 income ranges:
of CA returns
exemptions
itemizers
returns with wage income, # with capital gains, # subject to federal

AMT, # with SALT deduction, and so on
$ value of AGI, $ wages, $ cap gains, $ AMT, $ SALT, and so on We might have 30 targets for CA for each of 5 income ranges, or 150 targets.

We then ask the question, how can we adjust our initial weights so that we hit all 150 targets, and what is the best way to do so? The how part requires solving a set of equations for each income range j such as the following:

sum over i: wt.init[i] x[i] capgains[i] * {agi is in income range j -- TRUE/FALSE} = target $ capital gains in income range j

sum over i: wt.init[i] x[i] (capgains[i]>0 -- TRUE/FALSE) * {agi is in income range j -- TRUE/FALSE} = target # of capital gains returns in income range j

and so on for all 30 constraints for income range j,

where wt.init[i] is the initial weight for return i from stage 1 and x[i] is the adjustment factor for return i's weight.

In other words, we are solving for 150k x[i] values (when we consider all income ranges) that will ensure that our targets will be hit. In general, there won't be a unique set of 150k x[i] values -- there is more than one way to accomplish this -- so we have to have some way of choosing the best set of 150k x[i] values. This is where optimization comes in. We need to choose an objective function of the x[i]'s that we will then optimize.

IF we think that the initial stage 1 weights are a pretty good first cut, and don't want to deviate too much from them, we could penalize differences from them. In that case, x[i]==1 would entail no deviation from the initial weights, and the farther x[i] is from 1, the greater the deviation. One such penalty function (of many possible penalty functions) might be:

minimize sum over i: (x[i] - 1)^2 * wt.init[i]

Here, the further x[i] is, the greater the penalty, with penalty increasing with the square of the distance.

Thus, we would choose 150k x[i] values in a way that minimizes this penalty function, while still hitting the constraints/targets described above (150 for a given state). We would also impose bounds, such as the requirement that for each x[i] we must have 0 <= x[i] <= some large number (e.g., 10).

Because this objective function prefers x[i] to be 1 or 0, the solution is likely to have many records where x[i] is zero. These records could then be dropped.

We'd then repeat for each state, with no constraints on how many times we can use a record.

Other people might say this is not a good penalty function - we want an objective function that doesn't assume that the initial values from stage 1 are meaningful. That would be a good conversation to have.

Anyway, that's the general idea.

Don

On Thu, Dec 21, 2017 at 8:19 AM, evtedeschi3 notifications@github.com wrote:

To clarify — and pardon my ignorance in all of this — how is the initial state assignment done? Randomly?

Also, in this approach, how does the algorithm decide whether or not to split records across states? It seems like with continuous weights there’s an infinite ability to split records.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-source-economics/taxdata/issues/138#issuecomment-353348738, or mute the thread https://github.com/notifications/unsubscribe-auth/AGPEmF9VvNJ9T4Wn-Ot76I3PYjQLNWlvks5tClrKgaJpZM4RHVR- .

Ah OK, the strategy I was missing was iterating over each state one by one and opening up the whole PUF sample for each. That makes sense, thanks for the clear explanation.

It is possible to assign a single state to each record in an unbiased manner. The way I have done this is to calculate a probability of a record being in each of the 50 states, and assign it to one of those state in proportion to those probabilities. That is, if a record has high state income tax the procedure will show high probabilities for New York, California, etc and low probabilities (but not zero) for Florida and Texas. Then the computer will select New York or California with high probability and Florida or Texas with low probability. In expectation the resulting totals will be the same as the "long" format but with some unbiased error. I have done this and find that state level aggregates match nearly as well as summing over all possible states. If desired, one could take 2 draws, or any other number. It would not be necessary to multiply the workload by 51.

Thanks Daniel. Do you have code for this procedure you’ve used that you’d be willing to share with us?

I only have Stata code. The variables below (cumulativ, onstate, etc) are vectors with an element for each record. p1 through p51 are the estimated probability of a record being in each state. r is a random value uniform on 0-1.

gen cumulative = 0 gen byte onestate = 1 quietly { forvalues state = 1/51 { replace cumulative = cumulative + pstate' replace cumstate' = cumulative replace onestate = state' + 1 if cumstate' < r } }

quietly { forvalues state = 1/52 { foreach v in vars' { generate pv'state' =v'*p`state' } }

It should be straightforward to do this in R or any language.

dan

On Fri, 22 Dec 2017, evtedeschi3 wrote:

Thanks Daniel. Do you have code for this procedure you’ve used that you’d be willing to share with us?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.[AHvQVQBlBiNmDIb-MlmvUByL-mQCIsynks5tDDBBgaJpZM4RHVR-.gif]

PSLmodels / taxdata

Adding state identification to the PUF #138

of CA returns

exemptions

itemizers

returns with wage income, # with capital gains, # subject to federal