Determine how to assign weights

All CDF comparisons we've looked at so far conflate two factors:

How well individual records match raw records
How well the weight is calibrated

So far we've been synthesizing weight as any other feature; @donboyd5 has been making it one of the first in the synthpuf sequence, while I've been making it the last in sequential random forests. As the most important feature, it may deserve special treatment.

Per the PUF handbook:

Weights were obtained by dividing the population count of returns in a stratum by the number of sample returns for that stratum. The weights were adjusted to correct for misclassified returns. ... The sample design is a stratified probability sample, in which the population of tax returns is classified into subpopulations, called strata, and a sample is selected independently from each stratum. Strata are defined by:

High combined business and farm total receipts of $50,000,000 or more.

Presence or absence of special Forms or Schedules (Form 2555, Form 1116, Form 1040 Schedule C, and Form 1040 Schedule F).

Total gross positive or negative income. Sixty variables are used to derive positive and negative incomes. These positive and negative classes are deflated using the Gross Domestic Product Implicit Price Deflator to represent a base year of 1991.

Potential usefulness of the return for tax policy modeling. Thirty-two variables are used to determine how useful the return is for tax modeling purposes.

We have a few options:

Continue synthesizing like any other feature.
Emulate SOI strata. I haven't checked whether these fields are in the PUF.
Apply linear programming to hit targets. This differs from current targeting in that it would not minimize deviations from an original weight. Instead it would set an objective function which minimizes some aggregate deviation measure from many targets. This could also be useful for state-level weighting in the future.

There may be others. IMO we should consider separating this problem from the problem of record synthesis, which should be evaluated on record-level similarity.

The third of these approaches is quite interesting but I'm not sure how to operationalize it unless we can define what makes for a good set of weights, beyond satisfying targets, because there will be a huge number of sets of weights that can satisfy targets.

If we can operationalize it, we can start to examine it empirically early on, using data files @MaxGhenis has already created - a problem that is large enough to be interesting, and small enough to keep us from worrying about computer resources. If we work out an analytic approach, we can easily apply it to more-sophisticated synthesized files, and compare to other approaches.

But let's see if we can operationalize it.

Let's assume we have 2 data files:

"actual," which is the true, nonreleasable file (we could use @MaxGhenis's 5% holdout dataset, test.csv, for this)
"syn," which is the synthesized releasable file (we could use the synthpop-synthesized file, synth_synthpop.csv, for this)

In round numbers, both files have 22k observations and 60 variables (including a weight).

We want to compare file-quality measures such as weighted wages or AGI against the values in actual (presumed to be correct), by income range, for 3 approaches to the weights:

use the synthesized weights in syn, which we know have some problems
use the synthesized weights in syn, adjusted as @donboyd5 suggested, obtained by hitting a set of targets while minimizing how far the new weights move from the synthesized weights (implicitly, this assumes that there is something good about the synthesized weights, and moving too far from them is bad)
throw out the synthesized weights in syn, and choose weights that minimize an objective function based on deviations from the targeted values, as @MaxGhenis suggests. I think this is based on the idea that approach 2 might be artificially and unnecessarily anchoring us to weights that simply may not be very good. Maybe we are better off constructing weights from scratch that will achieve all targets and not be anchored to original or synthesized weights.

We know from experience that version 2 virtually always will be easily solvable even with equality constraints for hundreds of targets. That is, there are many - perhaps thousands - of sets of weights that will satisfy the constraints exactly. (We're finding 22k unknown variables that satisfy a few hundred constraints - many different sets of variables can do this.) Approach 2 chooses the set of weights that are closest to the synthesized weights.

Now on to approach 3: any set of weights that hits all of the constraints exactly will minimize an objective function based solely on distance between targets and weighted values on the file, so how do we know which one to choose?

To make it more concrete, suppose we want to satisfy 9 targets constructed from the actual file:

number of weighted returns in AGI range $0 - $50k
number of weighted returns in AGI range >$50k - $100k
number of weighted returns in AGI range > $100k 4-6. total weighted wages in each of those ranges 7-9. total weighted itemized deductions in each of those ranges

In approach 2, we would choose 22k new weights that satisfy these 9 constraints and also minimize an objective function that penalizes change from synthesized weights. One simple objective function we could use to decide which constraint-satisfying set of weights is best is minimizing the sum over 22k records of the squared difference from 1 of the ratio of new weights to synthesized weights. Formally, this is:

objective = sum(x[i] - 1)^2

where each x[i] is the ratio of the new weight to the synthesized weight on a record.

There would be 9 constraints, as defined above.

This is a nonlinear program rather than an LP but the idea is essentially the same as what @MaxGhenis discussed. In practice we might use something slightly more complex.

In approach 3, what would we do? We might set the objective function up as the sum of squared differences between each constraint's calculated value and its target value -- something like:

choose new weights w to minimize the sum of the following 9 squared differences, where each i indexes into the 22k records:

objective = {sum(w[i] (AGI[i] in 0-$50k)) - target1}^2 + {sum(w[i] (AGI[i] in >$50k - $100k)) - target2}^2 + {sum(w[i] (AGI[i] in >$100k)) - target3}^2 + {sum(w[i] wages[i] (AGI[i] in 0-$50k)) - target4}^2 + {sum(w[i] wages[i] (AGI[i] in >$50k - $100k)) - target5}^2 + {sum(w[i] wages[i] (AGI[i] in >$100k)) - target6}^2 + {sum(w[i] itemded[i] (AGI[i] in 0-$50k)) - target7}^2 + {sum(w[i] itemded[i] (AGI[i] in >$50k - $100k)) - target8}^2 + {sum(w[i] itemded[i] * (AGI[i] in >$100k)) - target9}^2

(There are obviously scale issues in defining this objective function. We might scale the calculations so that each constraint is in [0, 1], or in some other way, but that's a next step after we figure out how to set up the problem.)

As I understand how @MaxGhenis put this forward, there wouldn't be any constraints - in approach 3, we would just minimize this objective function.

Obviously the objective function is minimized when all constraints are exactly satisfied, and so the optimal solution to approach 2 (based on its definition of optimality) would also minimize the objective function in approach 3. But many other sets of new weights would satisfy it, too - for example, any of the constraint-satisfying solutions that the NLP solver may have iterated through before it found the solution that minimized the objective function in approach 2. If the goal is to select weights that are better than those in approach 2 in some way, then either we need to add some sorts of constraints (I am not sure what) or somehow add a measure of "good" weights to the objective function. Is there some better definition of a good weight, other than one that is close to the synthesized weight? Would we rather have equal weights for all records, or something else? Even if we were to have hundreds of targets rather than 9 (and we probably would), we would almost certainly end up in this situation.

@MaxGhenis, can you elaborate on how we would implement this approach? I think we would need to somehow define what good weights would be, and incorporate them in the objective function. (And after doing that, it is not clear to me why it would be better to have the constraints incorporated into the objective than set out separately as constraints - the former requires us to deal with scale issues and possibly assign relative importance to constraints, and the latter does not.)

Thanks for formalizing this @donboyd5, your example objective function is exactly what I had in mind. As you say, we should also rescale, and probably weight the targets subjectively if we care about something like AGI more than something like # people under age 13. There could also be loose constraints like ensuring positive weights for all records, and that each individual target doesn't deviate too far (though including squared deviations in the objective function should get us far here).

It'd be great to get to the point where we have to choose among multiple sets of weights which each satisfy targets perfectly. I don't think we'll be able to get there unless we either synthesize many more records than the original PUF, and/or we haven't included as many targets as we should. I'd expect the objective function to include hundreds if not thousands of targets, including counts within crosstabs, averages, quantiles, etc., so it'll be hard to hit all of them well (just as the original PUF misses on some targets).

BTW ideally the targets in approach 3 are the same as those defined for other parts of the problem, like those from approach 2. We're really just reshaping the problem from constraints to an objective function.

This ties to @feenberg's concern raised yesterday that by weighting records to hit an objective function and then evaluating the records on the same objective function is unfair. This could justify a couple modifications:

Using a third holdout set for comparing methods. This is common in prediction problems, for example Kaggle's data science competitions often provide a training and validation set (used to tune a model), and then choose the ultimate winner based on a third test set. So if we have competing weighting schemes we could choose based on that third holdout.
Adding a regularization penalty to avoid overfitting. For example in https://github.com/open-source-economics/taxdata/issues/138#issuecomment-352869297 regarding state-level weights, I suggested an L1 penalty as used in LASSO regressions. The L1 penalty equals the sum of absolute value of the magnitude of coefficients to the objective function. This turns out to shrink coefficients on the least useful features to zero, and has been shown to avoid overfitting. I'd have to give more thought to how this would analogize to our case, but ideally we'd do something to both avoid overfitting and discard least-useful records, e.g. those that are very similar to other records. For example, rather than synthesize 100k records, we could synthesize 200k and throw out the 100k least valuable ones, and end up with a better fit.

Here are some findings from @donboyd5 with respect to the initial test synthesis file, which synthesized s006 as a non-seed variable:

The first table below looks at the 3 10% sample files. The columns are # of records, sum of s006 (didn't bother to divide by 100), and sum of wages (unweighted). Obviously the sum of the weight comes in well below either the training or test amount. I checked in Excel to make sure I didn't have some odd error reading the file. The unweighted sum of wages also is quite far from test and train but of course it's very early in the process. The second table repeats this for the full puf and synthesized version, to make sure it is not an artifact of the sample.

The 3rd and 4th tables show quantiles of s006 in the 10% sample and full files, respectively. The extremes are not far off, but the middles are.

I am going to guess this is related to the sequence of fitting and synthesis. In a future run, it might be worth forcing s006 to be an X variable (carried over to the synthesized file as is) or making it one of your randomly sampled seed variables.

@feenberg also found that "the mean e00200 was $186,979."

I'll create another file with s006 as a seed variable as a first step, though this could also be due to other model design issues like too few seed columns, too few trees, etc.

On Tue, 11 Dec 2018, Max Ghenis wrote:

Here are some findings from @donboyd5 with respect to the initial test synthesis file, which synthesized s006 as a non-seed variable:

I have a program that scores 35 or so plausible tax reforms with the PUF and another file. If the alternate file is just the PUF rounded to 2 digits, the scores are very close. I'd like to try the synth file again, but the first draft gave scores that were not good. I'll try again with the next version.

Dan

That's great, Dan. It will be great to see it after you've got it to your satisfaction.

Don

On Wed, Dec 12, 2018 at 7:12 AM Daniel Feenberg notifications@github.com wrote:

On Tue, 11 Dec 2018, Max Ghenis wrote:

Here are some findings from @donboyd5 with respect to the initial test synthesis file, which synthesized s006 as a non-seed variable:

I have a program that scores 35 or so plausible tax reforms with the PUF and another file. If the alternate file is just the PUF rounded to 2 digits, the scores are very close. I'd like to try the synth file again, but the first draft gave scores that were not good. I'll try again with the next version.

Dan

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/donboyd5/synpuf/issues/9#issuecomment-446566837, or mute the thread https://github.com/notifications/unsubscribe-auth/AGPEmFY-PIaXXmvNiWi5PVEdE75ndiFDks5u4PLDgaJpZM4Yqi5k .

donboyd5 / synpuf

Determine how to assign weights #9