donboyd5 / synpuf

Synthetic PUF
MIT License
4 stars 3 forks source link

Selected results after reweighting #30

Open donboyd5 opened 5 years ago

donboyd5 commented 5 years ago

I put together an R program that will reweight any full-PUF-synthesis to hit a large number of targets.

In the initial run, I reweighted synthpop3 using 63 targets: 11 income ranges x 6 variables, less 3 targets that were not feasible because all records in the income range that had the variable of interest had only zero values for the variable (thus, no adjustment to the weights for those records could change the sum of weighted values).

Income ranges:

agi.ranges <- c(
  "c00100 < 0",
  "c00100 == 0",
  "c00100 > 0 & c00100 <= 25e3",
  "c00100 > 25e3 & c00100 <= 50e3",
  "c00100 > 50e3 & c00100 <= 75e3",
  "c00100 > 75e3 & c00100 <= 100e3",
  "c00100 > 100e3 & c00100 <= 200e3",
  "c00100 > 200e3 & c00100 <= 500e3",
  "c00100 > 500e3 & c00100 <= 1e6",
  "c00100 > 1e6 & c00100 <= 10e6",
  "c00100 > 10e6 & c00100 <= Inf")
agi.ranges

Variables: vars.to.target <- c("wt", "c00100", "e00200", "e00300", "e00650", "p23250")

Here are a few results for the puf, synthesis-before-reweighting, and synthesis-after-reweighting by income range, plus differences from puf and % differences. They seem fairly heartening to me. Tax before credits within 0.2% on the bottom line and close in most income ranges despite not being targeted.

Tomorrow, I will clean up the program, add some more targets, and push to github - it will be a program in the misc section of EvalWtdSyn. I will also put the relevant reweighted file and counterparts to synpuf, in case @feenberg can run it through the tax reform routine. And I will try to run it through some of the descriptive routines so we can see what kind of unintended side effects reweighting may have on non-targeted values.

I also show below a histogram of the adjustment factors for the weights. BTW, the optimization runs in about 29 seconds.

Weighted number of records by AGI range:

image

Sum of weighted AGI, in $ billions, by AGI range:

image

Sum of weighted wages, in $ billions, by AGI range:

image

Sum of weighted tax before credits, in $ billions, by AGI range:

image

Distribution of the weight adjustment factors:

image

donboyd5 commented 5 years ago

Here are the same 4 tables with taxbc and pensions added to the targeting; 82 targets. I'll try to add targeting by major marital status, too.

image

feenberg commented 5 years ago

Are we worried that some of the weights may have been overly mangled? I recall that is something that Ohara worried about (and restricted).

dan

On Thu, 20 Dec 2018, Don Boyd wrote:

Here are the same 4 tables with taxbc and pensions added to the targeting; 82 targets. I'll try to add targeting by major marital status, too.

image

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.[AHvQVS7Mo8oES8fetfQpwqMw4vlRBN6-ks5u7AHCgaJpZM4Zc5Fn.gif]

donboyd5 commented 5 years ago

Always, but maybe less here than in other applications, because we are less sure that the synthesized weights are good starting points that we should anchor to than we are when adjusting true-PUF weights.

That said, there are two things that guard against weight-mangling:

  1. The penalty function being minimized penalizes large changes in weights.
  2. I put bounds on allowable weight adjustments. In the runs above, I forced the adjustment factor to fall between 0 and 2. 0 is not a meaningful protection, but 2 is (no weight can be made more than twice as large as its original value.

And after the optimization, we look at the distribution of the weight adjustment factors. That's what the histogram above shows. To my eyes, it doesn't look worrisome but I don't have a loss function - don't know what it should look like.

donboyd5 commented 5 years ago

Updated results

I have reweighted synthpop3 with 425 constraints (targets), as follows:

That yields 462 possible targets (11 x 3 x 14).

Some of these are not feasible. For example, because every record's agi value in the agi==0 range is 0, a target for total agi in this range is meaningless (no matter how you change the weights, total agi in this range will always be zero), although targets for other variables in this range can be meaningful (e.g., we may want to vary weights so that we hit a target for total dividend income in the agi==0 range). But the agi variable by itself creates 3 infeasible targets (1 income range x 3 marital statuses x at 1 variable (agi)). 34 other combinations also were infeasible.

That leaves us with 425 feasible targets.

A few of the synthetic values were so far from the targets that I had to put broad tolerances around the constraints, but most were not too hard to hit. The optimization routine is designed to hit the targets within the tolerances I set, while minimizing a penalty measure that is based on the ratio of the new weight from the optimization to the synthesized weight. I put bounds on this ratio so that it could not be less than 0.1 nor more than 3.

It was more difficult to solve than yesterday's optimization that only involved 82 targets. It required 63 iterations and took 30 seconds to solve.

I have put three files in the synpuf directory: synthpop3_puf.csv, synthpop3_syn.csv, and synthpop3_rwt.csv. The first two are simply minor variants on PUF and synthpop3 - prepared so that they can be run through tax calculator, but already including two outputs from that process, c00100 and taxbc. The third file is the reweighted verion of synthpop3_syn.csv - the only difference is that the wt variable is the NEW post-optimization weight and it also has a variable, wt_rawsyn, which is the synthesized weight, not further adjusted.

@feenberg, it would be great if you can run these through your tax reforms. Please note that if you need S006, then for the reweighted synthesized file you will need to create S006=wt and, depending on how you scale it, you may need to multiply by 100 because I did the division. Also you may need to drop c00100 and taxbc to be safe.

Here are selected results. There are 9 tables below in 3 sets of triplets. In each table, the columns show the values in puf, synthesized, and reweighted snythesized, followed by differences from puf and then % differences from puf. The triplests:

You can see that we still have difficulty with the negative AGI range, although there is essentially no tax liability there (under current law). We also are having trouble with the $200-500k range.

After the table I put a histogram of the "x" values -- the ratio of new weight to synthesized weight.

I am about tapped out until I return on Jan 12. There are plenty of things to look for, including:

I will push the repo to github shortly.

image

image

image

image

donboyd5 commented 5 years ago

I have pushed it to github.

donboyd5 commented 5 years ago

@MaxGhenis, one approach to synthesizing weights from whole cloth, but sticking with our fast current technology (ipopt), might be to:

Any thoughts? If we can devise such a differentiable objective function, we could pile on as many contraints as we want, I think.