donboyd5 / synpuf

Synthetic PUF
MIT License
4 stars 3 forks source link

Which file, exactly, should we synthesize, and what is the right order of operations? #11

Open donboyd5 opened 5 years ago

donboyd5 commented 5 years ago

We closed issue #4 which was about what we should do with calculated variables. We agreed the thing to do is synthesize "elemental" variables (from which other variables may be calculated), and run the results through Tax-Calculator to get the calculated variables, giving us properly balanced tax returns. (We did not resolve the question of whether to use the initial calculated variables that are on the pre-synthesis PUF as righthand side X variables in the synthesis and then throw them away (because of course we want the actual calculated variables to be calculated from synthesized elemental variables). We agreed there probably are conceptual pros and cons to this approach, and we'll be open-minded and empirical about it.)

However, I think there is another issue that came up in issue #4. @andersonfrailey said:

I think we need to be careful about synthesizing only the enhanced PUF that we use in Tax-Calculator. Many of our enhancements come after we've augmented the PUF with the CPS file and I worry that trying to synthesize the PUF after we've augmented it will negatively affect our results.

I think he meant that we need to think about what file, exactly, we want to synthesize. In other words, what is the right order of operations?

Approach A One possible ordering is:

  1. Synthesize the raw PUF obtained from SOI
  2. Augment the synthesized PUF via statistical match in a fashion similar to current augmentation (add nonfilers from CPS, add selected CPS variables)
  3. Enhance the augmented-synthesized file by imputing itemizers, pension contributions, and the prime-spouse wage split (among other enhancements), producing a final releasable file (based on current processes).

Approach B Another possible ordering is:

  1. Augment the PUF from SOI via statistical match in a fashion similar to current augmentation (add nonfilers from CPS, add selected CPS variables)
  2. Enhance the augmented PUF by imputing itemizers, pension contributions, and the prime-spouse wage split (among other enhancements), producing a final NON-releasable file (based on current processes).
  3. Synthesize the enhanced-augmented PUF to produce a releasable file.

@andersonfrailey, is this the kind of question you were getting at? And if so, am I correct in interpreting your comment as saying that the first approach - Approach A, synthesize before we augment and enhance - makes more sense?

If so can you (and all of us) elaborate on the pros and cons of the two (or alternative approaches)? The first approach does seem to me like it has a lot of advantages:

I do have one question, probably for @andersonfrailey: If we do Approach A, will we have all of the needed variables on the file after stage 1 (synthesis of raw PUF) to run the synthesized raw PUF through Tax-Calculator to get calculated variables, so that we can examine file quality long before we start the statistical match process? (I believe so.)

One possible downside of the first approach is that we would start with a synthesized file early in the file creation process. Thus, we would not automatically create a "gold standard" fully merged file (actual PUF merged with CPS) unless we did another step.

Anyway, I think it would be good to discuss this.

In order to help me think about this, I finally did something I should have done a long time ago, which is outline the full PUF-based file creation process. The results are here, in case anyone else finds them useful (and @andersonfrailey, if you see anything you don't think is right, would much appreciate a heads up).

MaxGhenis commented 5 years ago

I'll cast a vote for Approach A for these reasons:

  1. It would allow us to more cleanly separate the enhancement logic from the PUF synthesis. As long as all enhancement techniques can take any PUF-like file as input, this repo's scope can be limited to creating a synthetic version of the raw PUF. That would lend itself to cleaner project management and more modular code.
  2. To the extent that any enhancements include logic to ensure correct totals, we might have to re-do this if we synthesize those features. For example, synthesizing imputed benefits would be unlikely to yield the correct total participation counts, as C-TAM does now. As a result we'd probably have to add all imputations to the weight adjustment, either as pieces of the optimization function or as constraints.
  3. Computational simplicity, as you note.