What do we think about using calculated variables as X variables?

donboyd5 commented 5 years ago

In issue #4 @feenberg said:

I would have a different procedure:

1) Synthesize all elemental variables using the calculated variables as a base. Use a mechanical application of RF or CART.

2) Recalculate the calculated variables and substitute the new values into the file.

3) Calculate a revenue score by AGI lass for a small finite change in each parameter of the tax calculator using the PUF and synth.

4) Calculate the correlation between scores calculated in the two different ways.

I'd like to focus on the first two steps, which are about synthesis procedure, and let's not focus (in this issue) on steps 3 and 4, which are about file quality evaluation.

I had thought about steps 1 and 2 but had not done it that way. I was concerned, maybe erroneously, about including too much "actual" RHS information. But as I think about it now, maybe it makes a lot of sense. I'm curious to see what @MaxGhenis thinks.

I guess it shouldn't create any new disclosure risk? (If we assume that PUF records are a disclosability concern). Maybe it's just an empirical question: we should take a look and see how well it does?

feenberg commented 5 years ago

I suppose that any procedure that improves the quality of the synth file is a disclosure risk at some level. The advantage of using the calculated variables as a base is that the correlations with many of them are quite important in getting revenue scores right. For example, getting the correlation of dividends with interest correct is of minimal importance if the correlations with AGI are correct for both of them. By emphasizing the correlations that are important for revenue scoring, this will let us get by with worse correlations on less important pairs. And note that the correlations with AGI, AMTI, etc are crucial, it isn't just the size of the income amount that determines the importance of getting it right.

I haven't said if CART should be applied sequentially to the elemental values, or each elemental value synthesized only from the list of calculated values. The former might be much better - I don't know. It may not be necessary and will allow much smaller disclosure risk.

We are already discussing quality issues - all the graphs of CDFs of income are only interesting as one way to evaluate quality. I worry that will lead to an endless loop of improving one measure at the expense of others and no obvious way to decide which is better. What I like about my suggested method is that it provides a univariate comparison of methods, so it is possible to choose an unambiguous winner.

donboyd5 commented 5 years ago

I agree completely that we need to discuss measuring file quality and that we have begun doing so. Just didn't want to muddy this issue (using calculated values as X variables) with that huge question. I've opened issue #8 for that.

MaxGhenis commented 5 years ago

Re:

Synthesize all elemental variables using the calculated variables as a base. Use a mechanical application of RF or CART.

Is this saying to use something like AGI as one of the first X variables, then calculate components of that using CART/RF synthesis? Is this to ensure we're stratifying by AGI correctly, because it's such an important feature?

Calculating calculated variables post-synthesis seems most efficient to me. If the relationship between dividends and AGI is defined as a formula (given covariates), what's the value of checking its correlation? If that correlation is off, it indicates that dividends aren't correlating appropriately to other determinant(s) of AGI, and we can evaluate that directly.

feenberg commented 5 years ago

On Mon, 19 Nov 2018, Max Ghenis wrote:

Re:
   1. Synthesize all elemental variables using the calculated variables as a
      base. Use a mechanical application of RF or CART.
Is this saying to use something like AGI as one of the first X variables, then calculate components of that using CART/RF synthesis? Is this to ensure we're stratifying by AGI correctly, because it's such an important feature?

Yes, the 27 calculated variables would be used to synthesize the others. This will give Cart every chance to get the correlation with AGI right, and also (and just as important) the correlation with marginal tax rate and no-tax status right.

Calculating calculated variables post-synthesis seems most efficient to me. If the relationship between dividends and AGI is defined as a formula (given covariates), what's the value of checking its

That doesn't sound right. CART isn't a regression of dividends on AGI plus some noise, is it? The "Classification" piece is critical.

correlation? If that correlation is off, it indicates that dividends aren't correlating appropriately to other determinant(s) of AGI, and we can evaluate that directly.

I am confused by this. If we synthesize AGI and its components, they will not match and the return will not balance. Would we allow that? I assume we will use the tax calculator for calculated values, and I advocate using the taxpayer version of the calculated values in the synthesis.

MaxGhenis commented 5 years ago

I think we might be in agreement but talking past each other (see https://github.com/donboyd5/synpuf/issues/4#issuecomment-440339754). Can we discuss at the end of today's 2pm call?

MaxGhenis commented 5 years ago

Recapping our chat:

We agree that we need to evaluate the result on fidelity, privacy, and computation time.
I think we agree that seeding the algorithm with the 27 calculated variables will improve fidelity at the expense of privacy, compared to seeding with elemental variables. I think it'll also require more compute, both because we're synthesizing more variables (seeding with 27 elemental variables instead of 27 calculated variables would be 27 fewer elemental variables to synthesize), and because the models will include 27 more X variables.
We can try both approaches to see how the benefits compare to the costs. In particular, we may want to give extra weight to fidelity between elemental variables and key calculated variables like AGI, which the seeding approach might help with.

I'd also add that evaluating against a holdout will probably show that seeding with such rich data will overfit, relative to seeding with a smaller set of elemental variables. synthpop seeds with a single variable, sampling with replacement from the training set, which I think remains worth exploring. Philosophically, I think we should consider the holdout approach to balance privacy and fidelity, though we could also consider training and comparing against the full set.

feenberg commented 5 years ago

On Tue, 20 Nov 2018, Max Ghenis wrote:

Recapping our chat:

We agree that we need to evaluate the result on fidelity, privacy, and computation time.

I think we agree that seeding the algorithm with the 27 calculated variables will improve fidelity at the expense of privacy, compared to seeding with elemental variables. I think it'll also require more compute, both because we're synthesizing more variables (seeding with 27 elemental variables instead of 27 calculated variables would be 27 fewer elemental variables to synthesize), and because the models will include 27 more X variables.

We can try both approaches to see how the benefits compare to the costs. In particular, we may want to give extra weight to fidelity between elemental variables and key calculated variables like AGI, which the seeding approach might help with.

I still don't understand. There are 200 variables in the PUF. If we seed with 27 variables that leaves 163 variables for CART or RF to synthesize, and 27 for TaxBrain to calulate. If we seed with an elemental variable that leaves 162 to synthesize and 27 for TaxBrain to calculate. Is that the computational difference that worries you? It seems small to me. Or is the problem that synthesis takes longer with more seed variables? That also seems to make for a small difference. The last variable to be synthesized is based on 161 prior variables if we ignore the calculated variables or 199 if we use them as seeds. Is that the worry?

I do understand that 2^27 is a large number, and may mean that we can't use 27 seed variables, depending on how the synthesis is done. 2^10 is small compared to the number of records though. so we ought to be able to seed with at least 10 variables. We still have to use TaxBrain for all 27 in our released file.

I don't think we can say much about intrusion until we have the distribution of the count of values that come from a common source record.

dan

MaxGhenis commented 5 years ago

Here's how I'm thinking about it (we could decide to seed with 1 elemental, 27, or some number in between):

Approach	# regressions	Avg # features per regression
27 elemental seeds	146	73.5
1 elemental seed	172	86.5
27 calculated seeds	173	113

I don't know if there's a theoretical runtime function with CART/RF, but pretty sure it's worse than O(k) (if k=# features), so this could be material.

How does 2^seeds factor in?

donboyd5 commented 5 years ago

I am confused by this discussion. Let me elaborate and hopefully someone can straighten me out.

First, I'd like to understand the terminology.

In our example/assumption:

we have 200 variables on the file
27 are calculated from elemental variables, sometimes in very complex ways (e.g., agi, and tax variables)
173 are elemental

Do we agree that:

We ALWAYS must synthesize or construct the 173 elemental variables within the overall synthesis process - we would never take them as given? however, we might construct some of the lesser variables in very simple ways, so not all variables would entail the large computing cost of something like sequential RF
We would NEVER synthesize the 27 calculated variables via any CART-like method (and probably would never even construct within synthpop) - we ALWAYS want the calculated variables on the final file to be calculated in Tax-Calculator from the synthesized/constructed elemental variables. (This is a little strong. We might want to calculate a simple variable that is easily constructed from other variables - and synthpop allows this -- if the constructed variable is a more useful RHS predictor for subsequently synthesized variables than its elements are. And this constructed variable might or might not be one of the 27 calculated concepts. But in general, we want Tax-Calculator to do the "return balancing" work.)
The ONLY question in this issue is whether some of the as-provided-in-the-pre-synthesis-file-calculated-variables (presyn-calcvars for short) should appear on the right hand side (i.e.,. as exogenous predictors, to be discarded after use) in any of the analysis, and what the tradeoffs would be vis-a-vis fidelity, privacy, and computational resources.
IF we include some presyn-calcvars on the RHS, they ONLY serve as predictors for elemental variables - when we're done synthesizing the elvars, we throw the presyn-calcvars away, and run the file through Tax-Calculator to get the calcvars that we want for the synthesized file.
IF we include some presyn-calcvars on the RHS, it could entail substantial computing cost, depending on synthesis method - for traditional linear regression, that cost likely would be very low, but for tree methods, where each RHS variable could go into the tree and must be analyzed for splitting criteria, etc., the additional computing cost could be a large consideration (especially for sequential RF), so this is a very legitimate concern.
BUT while we want to understand, as we do our work, what drives up computing costs and why, and whether there are good ways to avoid that, that's not something we have to have well-figured out in advance (unless the problem is intractable and we need another way from the start). IF we think there are sound conceptual reasons for putting presyn-calcvars on the RHS as exogenous variables, we can just go ahead and start evaluating the tradeoffs.

Are any of these statements wrong? If so I misunderstand the issue. If not, let me move to my confusion.

My confusion is that I don't understand what we mean by "seed."

It first appears in the discussion when @MaxGhenis says,

we agree that seeding the algorithm with the 27 calculated variables will improve fidelity at the expense of privacy, compared to seeding with elemental variables

It is the "compared to seeding with elemental variables part" that makes me think I don't understand.

Let me start with the terminology of synthesis in general, and as it is implemented in synthpop specifically, and ask where "seed" fits in.

Normally, in synthpop, we think of each Yi variable as a function of a vector of exogenous non-snythesized variables X, and previous Y variables already synthesized. Thus Y3=f(Y2, Y1, X), and so on. (That's for estimation purposes. For prediction, we replace the RHS Y variables with Y-hat variables.)

In synthpop, X can be null - we can choose to have no exogenous predictors. If so, how do we predict the first Y variable? Synthpop draws randomly from its distribution. Then it estimates Y2=f(Y1), Y3=f(Y2, Y1), and so on.

That's all very clear. So when I think about what to do about presyn-calcvars, the question to me is whether we use somewhere between 0 and 27 of these as part of X. They never are part of Y. Similarly, elemental variables are always part of Y - they are never part of X. We always must synthesize every one of them. If X is null (we use 0 presyn-calcvars), then we synthesize Y1 very simply (random draws), but we still synthesize every one of them

Now, back to the seed discussion. The table above outlines 3 approaches.

In the first, of the 173 elemental variables, we have 27 seeds, and 146 regressions. I just don't get this. We can't be putting any of the elemental variables into X - they are never exogenous, by definition - so we don't mean they are X variables, so what are they? And how do we manage to only need to do 146 regressions - if we have 173 elemental variables, don't we need 173 regressions? If we say that some of them are somehow unimportant and can be constructed in simple ways, that seems unrelated to the question of calculated variables.

In the second row of the table, we have 1 elemental seed. That makes me think we're not talking about it as an X variable (which it cannot be), but as the first Y. It is synthesized by random draw, so I guess we could say it is not a regression, and we only need 172 regressions, it's just that I don't understand the terminology or how, if seeding is random draw of the first Y, we could have 27 first-Ys in the first row of the table.

Anyway, I hope this explains my confusion.

I don't understand 2^seeds either, but that may be because I don't understand what a seed is.

I'm sorry, this wouldn't be the first time I misunderstood something fundamental, but if someone could set me straight I'd appreciate it, ideally crosswalking the seed terminology to the synthpop (and synthesis more generally) terminology of X and Y variables.

MaxGhenis commented 5 years ago

I've been thinking of "seed" as any non-regression (/CART/RF) way of synthesizing a feature. As @donboyd5 said, synthpop's default behavior "synthesizes" the first feature via sampling with replacement, then runs others as regressions. So that's one "seed" to me.

In the random forests model, I started with a similar approach, but thought it could benefit from more seeds, since random forests don't do great with just one X. So I sampled with replacement combinations of the features MARS, XTOT, and age_head, as you can see in cells 15-16 of this notebook. That is, I slimmed down the training set to those three columns, then sampled rows from that slimmed-down training set with replacement. In theory we could do this with any number of elemental features, I picked 3 arbitrarily.

This is how I've been interpreting the proposal to seed with 27 calculated variables: slim down the training set to those 27 features, then sample rows with replacement. If the proposal is instead to take those rows as given, without sampling, it's the same idea of seeding the regressions with a minimum of 27 X variables (I'd prefer sampling with replacement as it seems more synthetic to me, but this might not matter much if we're using such a rich starting point).

So basically: synthpop has at least 1 seed; using all calculated variables as X's from the first regression onward would be 27 seeds; and if we choose to seed with elementals, we could pick any number of seeds.

donboyd5 commented 5 years ago

Thanks, @MaxGhenis, that clears it up for me.

I think the proposed idea for discussion probably was to take the 27 calculated variables (or subset) as given rather than to sample with replacement, although I don't have a strong opinion on which is better conceptually or how much difference it would make in practice.

I have long thought it makes sense to have MARS and s006 in X (I had been thinking include in X as given, but maybe sampled would be good) on the theory that each would give us a strong result for file quality (when judging weighted totals and distributions), and that neither should entail disclosure risk. age_head makes sense to me for the same reason. I think XTOT could, too, but @feenberg believes that SOI might think it could create disclosure risk.

donboyd5 / synpuf

What do we think about using calculated variables as X variables? #7