Including weighted values and high-end income in evaluations of synthetic data file quality

This comment is an edited version of an email from @donboyd5 from 2018-11-17.

Using initial 5% runs from @MaxGhenis (because those are the data I have available, and it looks like the results are not all that different from the 50% run), the CDF for wages looked like this:

I was surprised at how great it looks in the upper income ranges. I did not initially appreciate two important things: 1) It is a log10 x scale so it really telescopes the low end, and 2) It is weighted number of observations, not weighted money value

Because I care so much about the money values (are we getting the right amounts, in the right income ranges), I wanted to look into money values more deeply.

First, just to provide some comfort that I'm doing it right, I reproduced in R the CDF for wages that was in the Jupyter notebook, approximately:

So far so good. About 37% or so of the weighted observations have wages <= $10k. But where is the money? If we create the same graph but base the cumulative percentage on weighted money rather than weighted number of observations - that is, we use:

 cum.pct=cumsum(s006*value) / sum(s006*value)   on the observations ordered by wages,

         rather than

 cum.pct=cumsum(s006) / sum(s006)   on the observations ordered by wages

we get the graph below, and we start to see some less attractive things in the upper end of the income distribution -- the total amount of money wages is further from "truth" than we might like, which means that when we go to analyze tax policy reforms that affect the upper end of the income distribution, our revenue effects (and perhaps distributional effects) could be quite off. So we'd like to look at this more deeply.

Looking at the graph, we can see that we don't even get 10% of the total money for wages until we get to about 10 to the power 4.5 (which is ~$30k of wages). So let's look at the other 90%. Here is a graph that shows the cumulative distribution for weighted wages for wages above $30k (where wages above $30k constitute 100% of all wages - I simply drop observations with wages <= $30k); sorry the x-axis has such little labeling, but let's focus on the y axis. Now we see that somewhere just before we get to $1 million of total wages, synthpop is maybe 2% below test. That seems like a lot of money to me. And it looks like sequential random forests might actually be performing better than synthpop in the very upper part of the wage distribution.

Sometimes - not often - a table is worth a thousand pictures.

The first table below compares unweighted number of observations by wage range, across the 3 data sets. First three numeric columns are the actual values in the respective files. The two columns to the right of those are synthpop minus test, and sequential random forests minus test. The final two columns are the percentage differences from test.

Not sure I know what to make of it. But synthpop seems to be a lot closer than sequential random forests. We could compute a formal measure, of course.

Next table is weighted number of observations; columns have same meanings.

Again, to the naked eye, without a formal measure, synthpop looks a lot better. Those double-digit percentage errors are pretty disconcerting. We can see that we are going to have some big errors in weighted values, given how far off the numbers of weighted observations are.

Next, we look at average wages in each range:

Bad, but not as bad as the number of weighted observations.

Finally - and to me, extremely important - here is weighted wages, in billions of dollars. This is where the rubber meets the road:

Wow. synthpop is about 4% too high on the bottom line (that's a lot for tax policy analysis, in my opinion), and sequential random forests is about 6% too low. Even more startling, fully half of synthpop's error is in the top wage range -- wages of $500k and above, where it is 50% too high. By contrast, more than all of sequential random forest's error is in the $25k to $200k wage ranges. If we were examining a policy proposal to put a surtax on people with wages above $500k and use it to pay for a middle class tax cut, people would be dancing in the streets if we used the synthpop dataset, and hanging their heads in sorrow if we used sequential random forests.

Because so much of our error is caused by getting the weighted numbers of observations within wage ranges wrong, rather than average values wrong, we'll want to pay more attention to the former, although the latter could benefit from improvement, too.

My conclusions from all of this are that (a) we need to pay really close attention to weighted money values, (b) we need to pay really close attention to income ranges or cuts on the variables themselves, (c) we need to try to focus on this during the estimation and prediction phases, (d) we need to be prepared to fix things on the back end when they look really bad (e.g., reweighting), (e) we need visual and other evaluation methods that help us diagnose all of this, and (f) ultimately we would like formal measures that help us evaluate this.

donboyd5 / synpuf

Including weighted values and high-end income in evaluations of synthetic data file quality #6