Comparing synpuf2, synpuf4, and puf.full

donboyd5 commented 5 years ago

Let me know if there is a better way to provide snippets of feedback. I figure that until I have a regular set of output, I can open an issue, @MaxGhenis and others can look, and then we can close the issue as we move on to later iterations of a file. The goal here is to provide useful feedback to @MaxGhenis as he develops early runs. I'll have some standardized html output that I can put in the Google Drive folder fairly soon.

The next few tables show correlations between full synthesized files and the original puf (with aggregate records removed). They follow the same format, so I'll explain that with the first table only.

donboyd5 commented 5 years ago

This table shows correlations among variables in the 3 files puf.full, synpuf2, and synpuf4 -- the 3 all-records files we have at present. From discussion elsewhere we know that synpuf4 will have better weighted values than synpuf2, so I'll pay more attention to synpuf4.

First column ("combo") identifies the variables for which correlations are shown.
Next 3 columns show the correlations within each of the 3 files.
Next 2 columns:
diff2 = synpuf2 - puf.full
diff4 = synpuf4 - puf.full
Next 2 columns, for context, are sums in puf.full of absolute values of wtd variables, in $ billions:
awsumb1 is the sum for the leftmost variable in the "combo" column
awsumb2 is the sum for the rightmost variable

The table in this post and in the next post differ only in how they are sorted. This table is sorted by diff4. It shows only the top 50 values by this sort.

Here is how to read the first row:

it shows correlation of e00300 and p23250 (interest income, and LT net gains/losses)
their correlation in puf.full is .254; it is .246 in synpuf2, and .552 in synpuf4
the synpuf2 and synpuf4 differences from the puf.full correlation are -.009 and .297 respectively
the .297 is the worst absolute difference in synpuf4 (since diff4 is the sort variable)
interest income was $113 billion in the 2011 puf (excl aggregate records) and LT net gains were $466 billion

We can see that the 4th-worst correlation in synpuf4 is between wages (e00200) and interest income. This is concerning (to me) because wages are so large. I'll come back to this, I think.

donboyd5 commented 5 years ago

Here is the same table, sorted by awsumb1. Except for the first record, diff4 loooks pretty good.

MaxGhenis commented 5 years ago

Are correlations here weighted by s006?

donboyd5 commented 5 years ago

No. They are unweighted. So they should be the same as any you may have done. (Pls let me know if you have different results.) The only value-added here is having the weighted sums info in 2 rightmost columns.

Not trying to replicate what you may be doing. I want to focus on the weighted file. But as I was moving in that direction, I naturally looked at some correlations so wanted to make sure you have the info since I have it. I'll post some weighted values info soon.

donboyd5 commented 5 years ago

If it is easier for you to have a whole bunch of info at once, rather than a bit here and a bit there, pls let me know.

MaxGhenis commented 5 years ago

I haven't done correlations yet, was just checking. Bit-by-bit is good for me. This is useful, and it's interesting that synpuf4's correlations are of uniformly higher magnitude than the original PUF, and that this happened when adding weights to the seeds. Maybe the small number of trees (20) is causing insufficient dispersion. I'll start one with 40 trees.

donboyd5 commented 5 years ago

Some new summary results and where to find them

I have put an html file named "eval_2018-12-13.html" with some new summary results in the Google Drive synpuf directory. As we get further along, there will be new versions with new dates.

It is all still very early, and very rough. I think the section called, "How can synpuf2 weight be so far off and yet weighted wages are so close?" is very interesting. It shows that there is value in going beyond the bottom line, and looking at the distribution of weighted values.

I know we are moving beyond synpuf2 but there is a lot to be seen by looking at synpuf4, diff4 (synpuf4 minus puf.full), and pdiff4 (diff4 as % of puf.full) by wage range. If something is not clear, please let me know.

I will start putting some routinized results in the html output, after I run the files through Tax-Calculator and get calculated AGI and taxes. Probably tomorrow for that.

Shortly after that, I'll start producing some CDFs of weighted variables by AGI similar to the graph in #16 but with a line for puf and for each synthesis in the analysis. I will work toward some summary measures after that, but I think it is more important to have diagnostic information at this point. Once we have data files that are worth choosing among, we'll need measures that help us do that, but we're not there yet.

I am uploading the R project to github once I relearn how to do that (it has been awhile and I've since reinstalled Windows so have to get my machine set up properly), probably this evening. @andersonfrailey will then start looking at it, and we'll work together to make it better. @andersonfrailey, I had hoped to have the project cleaner than this but I'll shoot you a note explaining what makes sense to look at.

After all of that I'll put in some reweighting routines to hit many targets, but it will be a while before I can get there.

MaxGhenis commented 5 years ago

Sounds good, including these results in GH is OK since there's no record data right?

Answering the question in the section title "How can synpuf2 weight be so far off and yet weighted wages are so close?" synpuf2's total return count was 78% too low, but it greatly overestimated high-wage returns, and these happened to cancel out to yield the 0.4% difference in total wages. A remarkable coincidence, but shows that file has got some problems.

donboyd5 / synpuf

Comparing synpuf2, synpuf4, and puf.full #23