Measuring quality of the weighted file for its intended uses

donboyd5 commented 5 years ago

I think two of our most important file-quality goals should be having a file that is good for:

Measuring the revenue implications of tax policy changes
Measuring the distributional consequences of tax policy changes

Are these both crucial file-quality goals?

Are there other crucial file-quality goals?

How should we operationalize measuring file quality with these goals in mind?

donboyd5 commented 5 years ago

Broad approach to short-term goals -- getting base, syn, and synadj We established short-term goals (from now to ~ Feb 2019) in 12/4/2018 phone call that gave @donboyd5 lead on measuring weighted file quality, with input from full team.

I anticipate general approach as:

Read the synthesized file constructed by @MaxGhenis (call it "syn") and the base file it attempts to resemble (call it "base") from dropbox. In practice, base may be the holdout file and syn may be the synthesized version of same.
After some preliminary checks, construct a 3rd file (call it "synadj") that has adjusted weights, intended to ensure that synadj is extremely close to base along many aggregated dimensions (hundreds or maybe thousands). (Examples - # of returns, $ agi, $ wages, etc., by filing status and agi range).
Compare syn and synadj to base, and construct summary measures of how well the files do in these comparisons. See next post for a description of the comparisons. In a later post I will give initial thoughts on measures. Please comment on any and all.
Post or otherwise pass the results of this analysis back to @MaxGhenis and team. To the extent the information identifies weaknesses in syn that are due to problematic variables or relationships among variables, it may help @MaxGhenis improve syn.

donboyd5 commented 5 years ago

General approach to comparisons of base, syn, and synadj I anticipate 4 kinds of comparisons that increase in complexity and in their ability to tell us how well syn and synadj are performing:

Analysis of summarized weighted data. Here I will produce a broad set of summary statistics that tell us how well syn and synadj conform to base, in summarized categories. The goal will be to pick variables and categories that we believe are important in determining total tax liability and changes in tax liability, for categories of interest. For example, by agi range and major filing status, I'll compute # of weighted records (# of returns), total weighted agi, total weighted wages, total weighted capital gains, total weighted interest income, # of returns with positive wages, # with positive business income, total weighted positive business income, # with negative business income, total negative business income, etc. These are just examples, but a good indication of the kind of information that would be useful. A file could have very good record-level information on these variables and still have low quality for tax policy analysis if the weights are inappropriate.
Analysis of marginal tax rates. I will work closely with @feenberg to compute marginal tax rates for individual records and then construct meaningful summaries (e.g., p25, p50, and p75 of MTR by agi range and major filing status).
Comparison of base year tax law. Stack base, syn, and synadj on top of each other and run them through the tax-calculator CLI and then compare weighted results (e.g., tax liability) of all 3 files, across agi range and major filing status. (A good question is whether @MaxGhenis might want to do the same with the unweighted file, comparing syn and base. It might provide useful information on how well relationships among variables have held up in syn, compared to base.)
Comparison of 3 tax reforms to base year tax law. Repeat step 3 above for each of 3 tax reforms, and focus on the deltas (how much tax liability changes), and winners and losers, by agi range and filing status. @andersonfrailey has developed json files for three such reforms, that affect different kinds of taxpayers. We would use those, or amended versions of those for the analysis.

Why do all 4 comparisons?

I think the first - examination of the raw material - is just good practice. It lets us know where our constructed files are out of line with true (base) data, and will give an indication of the kinds of analyses for which a constructed file is likely to give us trouble.
@feenberg has argued forcefully that marginal tax rates will give us a very good idea of how well a constructed file will do in "scoring" tax reforms, particularly if examined by income range and other categories. The first test won't do this.
That leads to the third test: how well do the files do in estimating taxes under some base-year law? Are they close or not? If we have the MTRs right, then we probably should be good. But MTRs don't tell us if total taxes are about right, and we want that, too. This will help with that.
Finally, even MTRs and current-law analyses may not tell us how well the file will do in estimating the impact of a tax reform, which is one of our major purposes. For example, a constructed file could be right for the wrong reasons. It could have MTRs that are close to the base file MTRs by some measures because taxable incomes were about right, but it could have the wrong composition of taxable incomes. It could have, for example, too much wages and too little passthrough income. Along comes a tax reform that reduces effective rates on passthrough income and the constructed file will underestimate its impact. By analyzing selected tax reforms we'll get a better sense of how well a file will do at the intended tasks. But it is only practical to examine a few tax reforms. (That's why the test 1 analysis of many variables is helpful - if the summary data are right across many dimensions, the odds of a tax reform analysis being right are better, too.)

One possible result of test 4 is that we may learn that a constructed file is good for analyzing some kinds of reforms and not others. That would be valuable information for users.

donboyd5 commented 5 years ago

One possible summary measure for Comparison of base year tax law

One summary measure for item 3 in the list above (Comparison of base year tax law) would be to compute the cumulative distribution of weighted total tax (e06500) vs. AGI for our three files (the PUF, the synthetic PUF with synthesized weights, and the synthetic PUF with adjusted weights), where tax and agi are obtained by running the data files through Tax-Calculator.

An exploratory look would put all 3 distributions on a graph (similar to the graph in issue #16 but with 3 lines). The comparison could be formalized with two goodness-of-fit statistics (one to compare the fit of syn to base, and one to compare the fit of synadj to base). I don't think it would be a Kolmogorov-Smirnov test because that is univariate whereas this involves 2 variables (total income tax, and AGI), but I am sure we can choose an appropriate test.

It might also make sense to do the same comparisons on some of the underlying variables that will have a strong effect on tax calculations, such as major components of income and deductions.

donboyd5 commented 5 years ago

In #9 @feenberg wrote:

I have a program that scores 35 or so plausible tax reforms with the PUF and another file. If the alternate file is just the PUF rounded to 2 digits, the scores are very close. I'd like to try the synth file again, but the first draft gave scores that were not good. I'll try again with the next version.

@donboyd responded looking forward to seeing it.

donboyd5 commented 5 years ago

One thought: We are not synthesizing the 4 aggregated records (#15). When you compare actual puf to synthetic puf, would be important to drop those 4 records from the actual puf.

cc: @feenberg

donboyd5 / synpuf

Measuring quality of the weighted file for its intended uses #8