Which file, exactly, should we synthesize, and what is the right order of operations?

We closed issue #4 which was about what we should do with calculated variables. We agreed the thing to do is synthesize "elemental" variables (from which other variables may be calculated), and run the results through Tax-Calculator to get the calculated variables, giving us properly balanced tax returns. (We did not resolve the question of whether to use the initial calculated variables that are on the pre-synthesis PUF as righthand side X variables in the synthesis and then throw them away (because of course we want the actual calculated variables to be calculated from synthesized elemental variables). We agreed there probably are conceptual pros and cons to this approach, and we'll be open-minded and empirical about it.)

However, I think there is another issue that came up in issue #4. @andersonfrailey said:

I think we need to be careful about synthesizing only the enhanced PUF that we use in Tax-Calculator. Many of our enhancements come after we've augmented the PUF with the CPS file and I worry that trying to synthesize the PUF after we've augmented it will negatively affect our results.

I think he meant that we need to think about what file, exactly, we want to synthesize. In other words, what is the right order of operations?

Approach A One possible ordering is:

Synthesize the raw PUF obtained from SOI
Augment the synthesized PUF via statistical match in a fashion similar to current augmentation (add nonfilers from CPS, add selected CPS variables)
Enhance the augmented-synthesized file by imputing itemizers, pension contributions, and the prime-spouse wage split (among other enhancements), producing a final releasable file (based on current processes).

Approach B Another possible ordering is:

Augment the PUF from SOI via statistical match in a fashion similar to current augmentation (add nonfilers from CPS, add selected CPS variables)
Enhance the augmented PUF by imputing itemizers, pension contributions, and the prime-spouse wage split (among other enhancements), producing a final NON-releasable file (based on current processes).
Synthesize the enhanced-augmented PUF to produce a releasable file.

@andersonfrailey, is this the kind of question you were getting at? And if so, am I correct in interpreting your comment as saying that the first approach - Approach A, synthesize before we augment and enhance - makes more sense?

If so can you (and all of us) elaborate on the pros and cons of the two (or alternative approaches)? The first approach does seem to me like it has a lot of advantages:

the synthesis task is smaller
we don't have to try to synthesize variables for which we may not need to worry about confidentiality (although there probably are ways around this)

I do have one question, probably for @andersonfrailey: If we do Approach A, will we have all of the needed variables on the file after stage 1 (synthesis of raw PUF) to run the synthesized raw PUF through Tax-Calculator to get calculated variables, so that we can examine file quality long before we start the statistical match process? (I believe so.)

One possible downside of the first approach is that we would start with a synthesized file early in the file creation process. Thus, we would not automatically create a "gold standard" fully merged file (actual PUF merged with CPS) unless we did another step.

Anyway, I think it would be good to discuss this.

In order to help me think about this, I finally did something I should have done a long time ago, which is outline the full PUF-based file creation process. The results are here, in case anyone else finds them useful (and @andersonfrailey, if you see anything you don't think is right, would much appreciate a heads up).

donboyd5 / synpuf

Which file, exactly, should we synthesize, and what is the right order of operations? #11