donboyd5 / synpuf

Synthetic PUF
MIT License
4 stars 3 forks source link

List "base" (non-calculated) columns from PUF to synthesize #4

Closed MaxGhenis closed 5 years ago

MaxGhenis commented 5 years ago

@feenberg said in an email:

There are quite a number of variables in the PUF that are arithmetic functions of more basic variables. These include AGI, Itemized Deductions, Taxable Income, tax, etc. We can use all of these directly from the PUF as part of the synthesis procedure. There is no need to synthesize them via RF or CART.

Is there a list of such variables, so we can focus on base variables?

Having this for the CPS file as well--which is useful for testing in public on GH without worrying about disclosing data--would also be useful.

feenberg commented 5 years ago

On Fri, 16 Nov 2018, Max Ghenis wrote:

@feenberg said in an email:

  There are quite a number of variables in the PUF that are arithmetic functions of more
  basic variables. These include AGI, Itemized Deductions, Taxable Income, tax, etc. We
  can use all of these directly from the PUF as part of the synthesis procedure. There is
  no need to synthesize them via RF or CART.

Is there a list of such variables, so we can focus on base variables?

I would suggest:

E00100 AGI E02000 Schedule E E03260 Deduction for Self employment tax P04470 Total Deductions E21040 Itemized Deduction limitation E04800 Taxable Income E05100 Tax on taxable income E05200 Computed Regular Tax E05800 Income tax before credits E06000 Income subject to tax E06200 Marginal tax base E06200 Tax generated E06500 Total income tax E08800 Income tax after credits E10300 Total tax liability E09600 Alternative minimum tax E62100 Alternative minimum taxable income E07180 Total tax credit E06500 Total income tax E08800 Income Tax after Credits E10300 Total tax liability E59680 EIC used to offset income tax before credits E59700 EIC used to offset all other taxes E59720 EIC refunded E11070 Refundable Child Credit TXRT Tax rate code

We could also use Taxsim to calculate E18425.

I do suggest that we provide some value for every variable in the PUF, rather than select the ones we think are "useful", because part of the job of the synth file is to allow users to test programs for syntax before submitting to the actual PUF. Even if we won't have a useful number for, say "Payment with return" it is important to have some number there for such checking.

We don't want to get to caught up in getting the right distribution for every variable combination - that isn't the only purpose of the file.

Dan

Having this for the CPS file as well--which is useful for testing in public on GH without worrying about disclosing data--would also be useful.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.[AHvQVap-hySsqteZkrbb35f1452bSuA2ks5uv60hgaJpZM4YnTjo.gif]

donboyd5 commented 5 years ago

This raises some interesting questions about which variables to synthesize and how.

I think at least in the near term we should focus primarily on a file that plays well with Tax-Calculator. (If there is some additional important use for the file, we should be explicit about that.) If this is the correct principle, then we should definitely synthesize all variables that are essential to the running of Tax-Calculator. Perhaps there are supplemental variables that we also want to synthesize, or construct in some other probably less sophisticated way (e.g., "populate" is a term used by TPC).

I see 5 kinds of variables:

  1. Elemental variables (i.e, variables that are used in Tax-Calculator calculations) that are in the PUF as we receive it from SOI -- for example, e00300 interest received.
  2. Elemental variables that are not in the PUF as received from SOI, but are in the enhanced PUF that is used by Tax-Calculator -- values that are constructed by @andersonfrailey and/or @martinholmer and are essential to the operation of Tax-Calculator -- I think of the split of wages between prime and spouse (e00200p and e00200s) as an example.
  3. Variables we may want to construct to make synthesis better. For example, Tax-Calculator needs (I think) both e00650 (total dividends in AGI) and e00600 (qualified dividends). We may find that it makes economic sense to construct unqualified dividends as e00650 - e00600, and then synthesize qualified and unqualified dividends separately, and add them up to get total e00650. Why? Because we might find that simple unconstrained synthesis methods can result in qualified dividends being larger than total dividends on some synthetic returns, something I think logic and law would not allow - so either we need to do something more complex to impose a constraint, or else construct the components and add them together. (Yet another approach could be to synthesize total dividends and the share of total dividends that are qualified, with that share constrained to [0, 1].) All of this could happen unknown to Tax-Calculator - we would synthesize a variable it doesn't know about (unqualified dividends) but only include in our data the variables it needs (e00650 and e00600).
  4. Variables that are calculated by Tax-Calculator from these elemental variables, such as AGI (c00100) and tax before credit (taxbc), which may or may not have counterparts in the PUF from SOI (e.g., AGI in original PUF is e00100).
  5. Variables that are in the PUF as received from SOI but are not used in Tax-Calculator and may or may not be important for other purposes. TXRT tax rate code strikes me as such a variable.

I think we should focus on synthesizing variables of types 1, 2, and 3 -- agreeing completely with @feenberg that some are more important than others, and therefore deserve more analytic attention than others. Size generally should be a good indication of degree of analytic attention needed, but not always.

We should not synthesize variables of type 4. After synthesizing types 1, 2, and 3 we should run the resulting synthetic file - which at this point should be similar to any input file to Tax-Calculator - through Tax-Calculator using tax law for the base year of the file, to get the values for variables of type 4.

As for variables of type 5, I think they should have to fight for existence. If we have a demonstrated need - or a general principle -for why a variable or set of variables should be on the file, then they could be considered. As @feenberg says, maybe tax payment should be on the file. If that is the case, perhaps some very simple post-synthesis method -- e.g., x% of tax liability - should be used to "populate" the file.

If this is the approach, I think the workflow would be something like this:

  1. Synthesize a data set with elemental variables of types 1, 2 and 3 -- using more sophisticated methods for some than the other
  2. Initial evaluation to see if it passes laugh tests
  3. Run it (with type 3 dropped) through Tax-Calculator to get the calculated values for type 4 variables
  4. More-sophisticated evaluation, taking into account the values of type 4 calculated variables, which we will really care about
  5. If and when we are satisfied with step 4, we can construct those type 5 variables that justify their existence, using simple methods.

(Step 2 might be skipped or perhaps automated so that we do steps 1 and 3 before we evaluate the file seriously.)

That still leaves two questions not fully addressed: (1) how should the file constructed for Tax-Calculator relate to different PUFs -- PUF as prepared by SOI, and PUF as enhanced by @andersonfrailey?, and (2) how best to identify the type 1 and 2 elemental variables, and implicitly, the type 4 variables?

For question 1, at some point we will face a choice - should we synthesize the PUF as prepared by SOI, or the PUF as enhanced by @andersonfrailey? As the text above suggests, I think in the near term we have to synthesize the PUF as enhanced, or else we won't have a file to run through Tax-Calculator until someone enhances the file we synthesize. So that is an easy question. If and when we have a really good result, we would want to have a larger discussion about this as it would affect other workflows. Obviously it only makes sense to enhance once, so it probably always will be sensible to synthesize a post-enhancement rather than pre-enhancement PUF, but it is worth an explicit discussion at some point in the future. But the short-term answer is clear- synthesize the enhanced PUF.

For question 2, I think @feenberg's list looks good but what we should do to be srue is start with the required inputs to Tax-Calculator and make those the synthesis variables (types 1 and 2 above), rather than trying to synthesize everything but the calculated (type 4) variables. We could look to the Tax-Calculator inputs documentation, and then verify with @andersonfrailey and/or @martinholmer.

One final note: There may be variables on the PUF, or on the CPS-based PUF, that are important beyond their Tax-Calculator need, as @feenberg mentioned. I don't know what those other needs are but I imagine one or more among us does, and it would be good to include them, too -- but I'd still suggest that they are not as important as those that are needed for Tax-Calculator.

feenberg commented 5 years ago

While I generally agree with the sentiments expressed below, I have an additional consideration that may partially contradict some of them. I do wish that whatever methodology we adopt it should be reproducible on a different tax year without a lot of work. That is, I would hope we would have script or program that was sufficiently general that when the next year of data became available the script could be run against the new PUF without a research project to determine the details. This would be a great advantage, even if the quality of the imputations was reduced.

I fear that the TPC project will not be reproduible once the grant money is spent because it will take a year or more of analysis to modify the programs to another year. If this is the case, our project could be the long-lasting one, if it was easy to repeat.

Additional comments below.

On Sun, 18 Nov 2018, Don Boyd wrote:

This raises some interesting questions about which variables to synthesize and how.

I think at least in the near term we should focus primarily on a file that plays well with Tax-Calculator. (If there is some additional important use for the file, we should be explicit about that.) If this is the correct principle, then we should definitely synthesize all variables that are essential to the running of Tax-Calculator. Perhaps there are supplemental variables that we also want to synthesize, or construct in some other probably less sophisticated way (e.g., "populate" is a term used by TPC).

I do hope that we synthesize all the variables, even if we don't do a good job on all of them.

I see 5 kinds of variables:

  1. Elemental variables (i.e, variables that are used in Tax-Calculator calculations) that are in the PUF as we receive it from SOI -- for example, e00300 interest received.
  2. Elemental variables that are not in the PUF as received from SOI, but are in the enhanced PUF that is used by Tax-Calculator -- values that are constructed by @andersonfrailey and/or @martinholmer and are essential to the operation of Tax-Calculator -- I think of the split of wages between prime and spouse (e00200p and e00200s) as an example.
  3. Elemental variables we may want to construct to make synthesis better. For example, Tax-Calculator needs (I think) both e00650 (total dividends in AGI) and e00600 (qualified dividends). We may find that it makes economic sense to construct unqualified dividends as e00650 - e00600, and then synthesize qualified and unqualified dividends separately, and add them up to get total e00650. Why? Because we might find that simple unconstrained synthesis methods can result in qualified dividends being larger than total dividends on some synthetic returns, something I think logic and law would not allow - so either we need to do something more complex to impose a constraint, or else construct the components and add them together. (Yet another approach could be to synthesize total dividends and the share of total dividends that are qualified, with that share constrained to [0, 1].) All of this could happen unknown to Tax-Calculator - we would synthesize a variable it doesn't know about (unqualified dividends) but only include in our data the variables it needs (e00650 and e00600).

One more possibility - synthesize qualified and unqualified dividends, then calculate total dividends from the components. There are 2 elemental variables here and one to be calculated. Pick any 2 to synthesize. It may not even matter.

  1. Variables that are calculated by Tax-Calculator from these elemental variables, such as AGI (c00100) and tax before credit (taxbc), which may or may not have counterparts in the PUF from SOI (e.g., AGI in original PUF is e00100).

These should come from the tax calculator, not synthesized.

  1. Variables that are in the PUF as received from SOI but are not used in Tax-Calculator and may or may not be important for other purposes. TXRT tax rate code strikes me as such a variable.

Using PUF TXRT in the synthesis step may help us get the correlations between elemental variables and marginal tax rates correct, which will help scoring revenue.

I think we should focus on synthesizing variables of types 1, 2, and 3 -- agreeing completely with @feenberg that some are more important than others, and therefore deserve more analytic attention than others. Size generally should be a good indication of degree of analytic attention needed, but not always.

I do fear that too much analytical attention paid to individual variables will interfere with getting the work done quickly, and being portable to the next PUF.

We should not synthesize variables of type 4. After synthesizing types 1, 2, and 3 we should run the resulting synthetic file - which at this point should be similar to any input file to Tax-Calculator - through Tax-Calculator using tax law for the base year of the file, to get the values for variables of type 4.

Yes

As for variables of type 5, I think they should have to fight for existence. If we have a demonstrated need - or a general principle -for why a variable or set of variables should be on the file, then they could be considered. As @feenberg says, maybe tax payment should be on the file. If that is the case, perhaps some very simple post-synthesis method -- e.g., x% of tax liability - should be used to "populate" the file.

Why not use CART for these?

If this is the approach, I think the workflow would be something like this:

  1. Synthesize a data set with elemental variables of types 1, 2 and 3 -- using more sophisticated methods for some than the other
  2. Initial evaluation to see if it passes laugh tests
  3. Run it (with type 3 dropped) through Tax-Calculator to get the calculated values for type 4 variables
  4. More-sophisticated evaluation, taking into account the values of type 4 calculated variables, which we will really care about
  5. If and when we are satisfied with step 4, we can construct those type 5 variables that justify their existence, using simple methods.

(Step 2 might be skipped or perhaps automated so that we do steps 1 and 3 before we evaluate the file seriously.)

I would have a different procedure:

1) Synthesize all elemental variables using the calculated variables as a base. Use a mechanical application of RF or CART.

2) Recalculate the calculated variables and substitute the new values into the file.

3) Calculate a revenue score by AGI lass for a small finite change in each parameter of the tax calculator using the PUF and synth.

4) Calculate the correlation between scores calculated in the two different ways.

That still leaves two questions not fully addressed: (1) how should the file constructed for Tax-Calculator relate to different PUFs -- PUF as prepared by SOI, and PUF as enhanced by @andersonfrailey?, and (2) how best to identify the type 1 and 2 elemental variables, and implicitly,

If synth has all the PUF variables with the PUF names, it is a good training dataset. If it is customized to our calculator, it isn't. If it has all the PUF variables and some more, it serves both purposes. If we don't overanalize the problem, we can do that.

the type 4 variables?

We could simply take the CPS imputations into the PUF before the CART, or repeat the CPS imputations after the CART. Whichever is easier, I suppose. Either way, they are going to be pretty far removed from their origin, and I wouldn't have high hopes for getting representative cross-correlations. Nor should that bother us.

Dan

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.[AHvQVRHEOh-8KBBTH2f9u42qfiX1Juxjks5uwUyAgaJpZM4YnTjo.gif]

donboyd5 commented 5 years ago

Up above @feenberg said:

I would have a different procedure:

1) Synthesize all elemental variables using the calculated variables as a base. Use a mechanical application of RF or CART.

2) Recalculate the calculated variables and substitute the new values into the file.

3) Calculate a revenue score by AGI lass for a small finite change in each parameter of the tax calculator using the PUF and synth.

4) Calculate the correlation between scores calculated in the two different ways.

I think steps 1 and 2 are important and have moved them to a separate discussion as issue #7.

andersonfrailey commented 5 years ago

I think we need to be careful about synthesizing only the enhanced PUF that we use in Tax-Calculator. Many of our enhancements come after we've augmented the PUF with the CPS file and I worry that trying to synthesize the PUF after we've augmented it will negatively affect our results.

MaxGhenis commented 5 years ago

Agreed @andersonfrailey. I'd suggest we synthesize the raw PUF, then pass that to the rest of the Tax-Calculator PUF creation procedure as the raw PUF is today.

My original question is whether any of the features in the raw PUF are derived from the rest of the raw PUF. If so we can synthesize the non-derived features, then calculate the derived features after synthesis. Without doing this we'll risk the derived features not making sense.

feenberg commented 5 years ago

On Tue, 20 Nov 2018, andersonfrailey wrote:

I think we need to be careful about synthesizing only the enhanced PUF that we use in Tax-Calculator. Many of our enhancements come after we've augmented the PUF with the CPS file and I worry that trying to synthesize the PUF after we've augmented it will negatively affect our results.

I don't understand. Augmented adds variables and records, CART can't improve the result, but why should the augmented material suffer worse than the PUF data?

Dan

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.[AHvQVYQi3xnv6r6_xSDIdpLVFjFUHQUSks5uxCt-gaJpZM4YnTjo.gif]

feenberg commented 5 years ago

On Tue, 20 Nov 2018, Max Ghenis wrote:

Agreed @andersonfrailey. I'd suggest we synthesize the raw PUF, then pass that to the rest of the Tax-Calculator PUF creation procedure as the raw PUF is today.

My original question is whether any of the features in the raw PUF are derived from the rest of the raw PUF. If so we can synthesize the non-derived features, then calculate the derived features after synthesis. Without doing this we'll risk the derived features not making sense.

If all calculated values are calculated after synthesis, then the return will balance. That seems like the way to go.

Dan

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.[AHvQVcSLyUPcdR1NFTOHWLmMYZhfvSFuks5uxC-9gaJpZM4YnTjo.gif]

MaxGhenis commented 5 years ago

I think this is resolved based on the list @feenberg provided in https://github.com/donboyd5/synpuf/issues/4#issuecomment-439658630.

Let's discuss how to use elemental vs calculated variables in https://github.com/donboyd5/synpuf/issues/7.

And cheers to the first resolved issue in the repo! 🥇

MaxGhenis commented 5 years ago

Reopening as @andersonfrailey can provide a list of minimal columns needed from the raw PUF.

andersonfrailey commented 5 years ago

Of the 209 variables that come in the raw PUF, we only keep 68 in the PUF used by Tax-Calculator. Here's a list:

{'dsi',
 'e00200',
 'e00300',
 'e00400',
 'e00600',
 'e00650',
 'e00700',
 'e00800',
 'e00900',
 'e01100',
 'e01200',
 'e01400',
 'e01500',
 'e01700',
 'e02000',
 'e02100',
 'e02300',
 'e02400',
 'e03150',
 'e03210',
 'e03220',
 'e03230',
 'e03240',
 'e03270',
 'e03290',
 'e03300',
 'e03400',
 'e03500',
 'e07240',
 'e07260',
 'e07300',
 'e07400',
 'e07600',
 'e09700',
 'e09800',
 'e09900',
 'e11200',
 'e17500',
 'e18400',
 'e18500',
 'e19200',
 'e19800',
 'e20100',
 'e20400',
 'e24515',
 'e24518',
 'e26270',
 'e27200',
 'e32800',
 'e58990',
 'e62900',
 'e87521',
 'e87530',
 'eic',
 'f2441',
 'f6251',
 'fded',
 'flpdyr',
 'mars',
 'midr',
 'n24',
 'p08000',
 'p22250',
 'p23250',
 'p86421',
 'recid',
 's006',
 'xtot'}

There are 89 variables in the PUF used by Tax-Calculator, the rest come from either the CPS or are derived by us during file preparation.

MaxGhenis commented 5 years ago

Thanks @andersonfrailey, good to know we need to synthesize at most 67 (skipping recid). Couple questions for you:

  1. Is there value in synthesizing flpdyr, or is that just to match back to the original PUF?
  2. A couple of these aren't in the Tax-Calculator documentation, like fded and p86421. Are they used in calculating other variables in the processed PUF, or can we skip them too?
MaxGhenis commented 5 years ago

One more: just want to confirm none of these are direct transformations of others, and can be calculated without synthesis?

andersonfrailey commented 5 years ago

Is there value in synthesizing flpdyr, or is that just to match back to the original PUF?

I don't see much value is synthesizing this. We could just make this equal to whatever year of the PUF we're synthesizing in my opinion.

A couple of these aren't in the Tax-Calculator documentation, like fded and p86421. Are they used in calculating other variables in the processed PUF, or can we skip them too?

I made a mistake keeping those in the list. Neither are ultimately used in the PUF, but fded is used is puf_data/finalprep.py and p86421 is actually dropped.

One more: just want to confirm none of these are direct transformations of others, and can be calculated without synthesis?

Correct. All of the variables that can be and ultimately are calculated in Tax-Calculator are dropped.

MaxGhenis commented 5 years ago

Got it, thanks. We'll synthesize all variables listed in https://github.com/donboyd5/synpuf/issues/4#issuecomment-442866667 except recid, flpdyr, and p86421 (65 total variables).

fded is used in https://github.com/open-source-economics/taxdata/blob/master/puf_data/finalprep.py#L50, so we'll synthesize that:

  cmbtp = np.where(data['FDED'] == 1, cmbtp_itemizer, cmbtp_standard)