FYI: Comparisons of IRS summary statistics for 2017 to puf.csv with: (1) default stages 1, 2, and 3, (2) custom stage 1 growfactors and default stage 2, and (3) version #2 reweighted

donboyd5 commented 3 years ago

You will find the 3 summary report files here. They should be visible to all.

All comparisons are for filers only, as defined in PSLmodels/Tax-Calculator#2501 prior to today (2020-11-06). @MattHJensen suggested an improvement to the filer definition which I will implement relatively soon but I don't think the results will change much.

Each summary report file has 3 sections:

Summary section that compares file totals (for tax filers)
Detailed section that compares results by income range
Documentation section that shows how I mapped puf variables to irs concepts

The files are:

irs_pufdefault_comparison.txt: this compares official puf.csv from 2020-08-20, using default (tax-calculator built-in) methods for stage 1 growfactors, stage 2 weights, and stage 3 interest income adjustments to grow to 2017.
irs_pufregrown_comparison.txt: official puf.csv preprocessed with custom growfactors, using default stage 2 weights, and NOT using stage 3 interest income adjustments:

growth in dividends e00600 and e00650 adjusted to match growth in average in irs data between 2011 and 2017
custom growfactor established for charitable contributions e19800 and e20100 to match growth in average in irs data between 2011 and 2017, instead of ATXPY
custom growfactor established for state and local taxes e18400, e18500 to match growth in average in irs data between 2011 and 2017, instead of ATXPY I chose to adjust these variables because growth in the raw puf was considerably different from the IRS, they are important variables for tax calculation purposes, and I think they will be important variables for apportioning record weights to states (I think they will be important discriminators across states)

irs_pufregrown_reweighted_comparison.txt: official puf.csv preprocessed with custom growfactors, starting out with default stage 2 weights, and NOT using stage 3 interest income adjustments, THEN reweighted in an effort to come close to many IRS totals. In general, I tried to target all 34 variables summarized in the 3 files, but I made a few exceptions for the 4 lowest income ranges where some of the puf values were very far from irs values and I was concerned that trying to hit IRS values would cause damage to other variables. In total there were ~ 32 variables x 18 income ranges, or ~ 576 weights (+/- a little). FWIW, the solution took 4.5 seconds.

A few comments and observations:

This may or may not be an improvement over puf.csv with default growth to 2017. I think it is - certainly many variables are much closer to IRS values. However, there undoubtedly are unanticipated side effects from trying to come close to so many targets that may only become apparent with time.
I welcome comments on the summaries.
For anyone with access to puf.csv who wants to look at the resulting regrown-reweighted file (and the regrown file), I am happy to provide it. It is the same as the regrown file except that it has 3 weights: s006, which is the reweighted weight for filers and the tax-calculator weight for nonfilers; s006_rwt, which is the reweighted weight for filers and na for everyone else; and s006_taxcalc, which is the original weight. The dollar values on the file are different than those on the default version of puf grown to 2017 because of the customized growfactors.

My next steps will be to try to apportion national weights on this file to a selected set of states.

Then, as I learn about what may be wrong with this national file, I will go back and try to improve it; with the first iteration of that I will fix up the filer-determination code as pointed out by @MattHJensen in PSLmodels/Tax-Calculator#2501.

Many thanks for any criticism you can provide.

MattHJensen commented 3 years ago

@donboyd5, thanks very much for these summaries. I am looking through them now. This certainly seems promising.

If you could share the regrown-reweighted file, I'd appreciate that. I'd like to poke around with what's happening to non-targeted variables and how much the weights are moving.

donboyd5 commented 3 years ago

Thanks much, @MattHJensen, for looking. Please see email I sent with link to the folder.

donboyd5 commented 3 years ago

The weights change quite a bit. I capped the ratio of new weights to old weights at 50. Here is quick distribution:

donboyd5 commented 3 years ago

@MattHJensen, I put 3 more files in the folder available to people with access to puf.csv:

targets2017.csv -- data from several IRS national files found here, with variable names I created; note that income ranges used by the IRS for itemized deductions are different from those used for income items
targets2017_collapsed.csv -- same thing, except that I collapsed data to standardized income ranges I created that are the same for income items and itemized deductions
targets2017_possible.csv-- a subset of the collapsed file, with variables I used (or considered) for targeting, and with a puf.csv variable name mapped to the variable names I created for the IRS files; I had to create a few puf variables (e.g., mars1, mars2, c01000pos, c01000pos_nnz -- most are pretty obvious - for example, c01000pos is c01000 if c01000 is positive and otherwise zero, and c01000pos_nnz is 1 if c01000 is positive and otherwise zero)

MattHJensen commented 3 years ago

I put 3 more files in the folder available to people with access to puf.csv:

I saw the extra files and appreciate your having included them. Thanks for the overview here, too.

The weights change quite a bit. I capped the ratio of new weights to old weights at 50. Here is quick distribution:

This replicated on my machine, which is a nice check I've got the right files.

It looks like 4,216 records are unused in the reweighted file, contributing to the ratios of 0 on the left hand of the distribution. (filers.s006_rwt == 0).sum().

In the past, the TaxData project has sought to minimize changes in weights (as you know) and has selected targets parsimoniously to avoid distorting the relationships among variables from their manifestations in the base data. So it is a departure to target more thoroughly and allow weights to move more freely. But there is a great deal to be said for hitting a broader set of SOI targets. It's also not clear that relationships in the the base data year are as meaningful as they used to be given elapsed time and policy changes. All of that is just to say, as the maintainer of a project that requires tax data extrapolations as inputs -- I'm really enjoying looking at these comparisons and thinking about the tradeoffs..

@chusloj and @andersonfrailey have been thinking about how to move TaxData's stage 3 interest adjustment into stage 2, so they may be interested to see how an alternative approach to setting up the problem could potentially make it easier.

donboyd5 commented 3 years ago

Thanks for looking, @MattHJensen. I agree: 1) It is attractive to keep the changes in weights minimal 2) And less attractive the further we get from the base year, on the assumption that changes in the economy, changes in tax law, and changes in behavior will make 2011 relationships less relevant the further we are from 2011 3) And less attractive if important variables that affect the revenue and distributional impacts of tax policies will be far off if we keep weight changes minimal

The question is, how much is too much? With sufficient time, we could work both ends toward the middle. Define a set of variables we care about, and define "correct" values for those variables (e.g., the IRS published totals). Then:

1) From one end, start with a minimal set of variables to target that we suspect are the most important (e.g., agi and # of returns by income range crossed by at least 2 marital statuses), and see how bad the important untargeted variables are. Supplement this by running tax-calculator to get (a) tax liability under current law, and see how far it (or several related variables) are from what we expect to be true (in total and by agi range and other cuts), and (b) changes in tax liability under an important policy alternative, and see how far off this is from what we expect from other sources (JCT? TPC? taxdata default?). Also evaluate how much weights had to change to hit/approximate these targets. It is possible to put restrictions on how much they change. When doing that, two things can happen: (i) we might still hit the targets by making a lot of lesser changes in other weights, and that might be desirable, or (ii) we might fall further from some targets, and we'd have to evaluate whether that's an acceptable compromise. We might be glad to be 5% off for an unimportant target if it means that we don't have to jerk weights around so much. But we might not want to be 5% off for agi or number of returns. It is also possible, in theory, to prioritize targets - e.g., put a weight of 1 on hitting agi targets in income ranges 9 and 10, and a weight of .8 on ranges 1 and 2, and a weight of only 0.2 on taxable interest in any income range. That seems like an important extension. I have done variants of this. Of course, then you need to quantify judgments about what's most important and what's not.

Then, add next most important variable and repeat.

2) From the other end start with a maximal set of targeted variables and do the same evaluation. Then drop the least important variable and repeat.

After a few iterations, we'd probably get a good sense of where the happy medium is. We might even be able to create some rules of thumb that help us make our judgments transparent and repeatable.

My problem is that right now I'm racing the clock and can't iterate much. Many of the variables I targeted seem essential for either evaluating tax policies or for apportioning weights across states. I do hope to do some analyses comparing tax calculations on the default puf and a regrown reweighted puf, and also on at least one policy variant, although it will depend on how fast I am at other things. And if you learn that this has just wrenched the data around too much and is leading to implausible results in some areas in comparison to the default puf, that would be really valuable to know.

donboyd5 commented 3 years ago

I should add that some of the iteration could result in us learning that some growfactors could be better. In my mind, the 2017 IRS national data gives us valuable information on how average values changed between 2011 and 2017 that could be used to modify growfactors. It is possible to be far more rigorous than I have been and that would be valuable. I just modified a few growfactors that seemed very important and where growth in IRS data was significantly different than what existing growfactors implied.

MattHJensen commented 3 years ago

After a few iterations, we'd probably get a good sense of where the happy medium is. We might even be able to create some rules of thumb that help us make our judgments transparent and repeatable.

Yes, this makes sense. As does, for a time-constrained new project, picking a sensible starting point and then looking for problems. I'm having fun with the data so far and will report back here with anything that seems of interest.

I should add that some of the iteration could result in us learning that some growfactors could be better. In my mind, the 2017 IRS national data gives us valuable information on how average values changed between 2011 and 2017 that could be used to modify growfactors.

This makes sense as well, and I suspect TaxData maintainers agree too. Even before this, it would probably be good to refine TaxData's filer identification strategy.

donboyd5 commented 3 years ago

@MattHJensen please see updated filers function in PSLmodels/Tax-Calculator#2501. Will plan to use in next iteration of reweighting, early next week, and welcome comments.

jdebacker commented 3 years ago

@donboyd5 @MattHJensen I am unsure of the status of this issue, but should it be moved over to the TaxData repo?

donboyd5 commented 3 years ago

@jdebacker @MattHJensen Yes, I think that makes sense. After taxdata is updated again soon, I hope to update some of this analysis.

PSLmodels / taxdata

FYI: Comparisons of IRS summary statistics for 2017 to puf.csv with: (1) default stages 1, 2, and 3, (2) custom stage 1 growfactors and default stage 2, and (3) version #2 reweighted #389