PSLmodels / taxdata

The TaxData project prepares microdata for use with the Tax-Calculator microsimulation project.
http://pslmodels.github.io/taxdata/
Other
20 stars 30 forks source link

Significant issue in running synthetic data through TaxData - seeking advice/help #330

Open donboyd5 opened 4 years ago

donboyd5 commented 4 years ago

Of interest to at least the following: @andersonfrailey, @MattHJensen, @martinholmer, @MaxGhenis, @feenberg, @marshallpku, and @gchen3

Despite well-laid plans, we have run into a significant problem running a synthetic version of the PUF through TaxData.

I think there is a short-term solution that will not require changes to TaxData and eventually could be incorporated into synthesis procedures.

My questions are: Will my proposed solution work? Is there a better short-term solution? Is there someone willing to assist in the short-term solution? @feenberg, if my workaround number 1 works, I would much appreciate your help.

Here is the problem: TaxData requires E00100. We have not synthesized E00100 and even if we did it would be wrong because it is a complex calculation based upon other variables we synthesize. It has to be calculated from those variables after they are synthesized.

This creates a circular problem:

Workaround number 1:

If not, then ... workaround number 2:

I have a third suggested workaround if neither of these works but it is more work, and more complex, and would entail changes to a version of TaxData; first I think it would be good to get some reaction to these ideas.

Thoughts, anyone on how best to proceed?

[Aside: If you are wondering how I previously tested our synthetic PUF in Tax-Calculator without running it fully through TaxData and addressing this issue, here's what I did then. I created 50/50 splits of the prime-spouse variables on all married records, for both the true PUF and the synthetic PUF, which created a file that could go through Tax-Calculator and allowed apples-to-apples comparisons of PUF and synthetic PUF. That was fine for that purpose and the current issue did not arise as we did not need E00100. But now we want a full TaxData enhanced PUF, with realistic estimates of prime-spouse splits and with other TaxData enhancements so we have to address this issue.]

feenberg commented 4 years ago

The tax calculator should not require E00100. Think about it. If you want to calculate the revenue from a change in the definition of AGI, then if the calculator uses E00100 from the data instead of the the calculated AGI, it will get the wrong answer. This should be fixed in the calculator.

Now, if the E00100 is only used as a tab variable in the display, my argument is only a little weaker. I wonder if E00100 influences the tax calculation at all in the current code. If it did, wouldn't it have come to light as an error in some score? Maybe it is on a list of required variables, but doesn't actually influence the calculation.

Dan

On Tue, 5 Nov 2019, Don Boyd wrote:

Of interest to at least the following: @andersonfrailey, @MattHJensen, @martinholmer, @MaxGhenis, @feenberg, @marshallpku, and @gchen3

Despite well-laid plans, we have run into a significant problem running a synthetic version of the PUF through TaxData.

I think there is a short-term solution that will not require changes to TaxData and eventually could be incorporated into synthesis procedures.

My questions are: Will my proposed solution work? Is there a better short-term solution? Is there someone willing to assist in the short-term solution? @feenberg, if my workaround number 1 works, I would much appreciate your help.

Here is the problem: TaxData requires E00100. We have not synthesized E00100 and even if we did it would be wrong because it is a complex calculation based upon other variables we synthesize. It has to be calculated from those variables after they are synthesized.

This creates a circular problem:

  • We want to run the PUF-format synthetic PUF through TaxData so that it can be used in Tax-Calculator.
  • To do that, we need E00100.
  • But we cannot synthesize E00100, we must calculate it.
  • Tax-Calculator can construct the calculated counterpart to E00100, c00100 (albeit not for the desired year of 2011, but for a close substitute, 2013).
  • But Tax-Calculator won't do this with a PUF-format file. It needs a file with certain of the enhancements ordinarily done by TaxData, particularly prime-spouse splitting of E00200, E00900, and E02100.
  • (Perhaps Tax-Calculator requires other enhancements made by TaxData but I don't know that.)

Workaround number 1:

  • @feenberg would it be possible to run the synthetic no-disclosures PUF through taxsim and get AGI that way? If so we would just name it E00100, add it to the synthetic PUF file, and call it a day (for now).
  • I really hope this is the solution.

If not, then ... workaround number 2:

  • It is important to VERIFY: Why do we need E00100 in TaxData if Tax-Calculator will calculate its own version, c00100? Answering this will tell us something about how faithful the AGI calculation needs to be to AGI that IRS would have calculated if it had constructed the synthetic PUF. I think E00100 is probably used in TaxData to help ensure that other values make sense - for example it probably is used in statistical matching against the CPS, or in determining benefit amounts, or in the targets used for reweighting. If these are the uses of E00100, then perhaps a calculation that is 95% or 98% faithful to the actual 2011 calculation (2011 is the year of our PUF) will be adequate.
  • @andersonfrailey can you elaborate on the important uses of E00100 in TaxData?
  • If my suspicion above is correct that E00100 is used for the reasons I gave, then with someone else's help, we might write a slimmed-down version of an AGI-calculation function -- or even pseudocode -- that relies only on variables found in the PUF. I assume this would be a version of AGI in calcfunctions.py in Tax-Calculator.
  • It would be specific to 2011, ideally but if need be I think in the short term it could be based on 2013. (I believe - subject to refutation by people who have looked more closely - that the changes in AGI definition between 2011 and 2013 were extremely minor.) Please keep in mind that our purpose here is testing a fully TaxData-enhanced synthetic PUF within Tax-Calculator and we need some sort of workaround to get us there. That means we can be imperfect now. This is not our first shot at it: we will eventually be synthesizing improved synthetic PUFs and as we do that we can improve our approach to this issue.
  • The problem for me is that the AGI calculation in Tax-Calculator relies on interim calculations that would take me a long time to understand. Much better if someone else could say here's how we would do it with PUF-only variables, just for 2011 (or 2013).
  • I would then write and run this function in R (or python), pre-processing the synthetic PUF to create an acceptable E00100 before it is fed into TaxData.

I have a third suggested workaround if neither of these works but it is more work, and more complex, and would entail changes to a version of TaxData; first I think it would be good to get some reaction to these ideas.

Thoughts, anyone on how best to proceed?

[Aside: If you are wondering how I previously tested our synthetic PUF in Tax-Calculator without running it fully through TaxData and addressing this issue, here's what I did then. I created 50/50 splits of the prime-spouse variables on all married records, for both the true PUF and the synthetic PUF, which created a file that could go through Tax-Calculator and allowed apples-to-apples comparisons of PUF and synthetic PUF. That was fine for that purpose and the current issue did not arise as we did not need E00100. But now we want a full TaxData enhanced PUF, with realistic estimates of prime-spouse splits and with other TaxData enhancements so we have to address this issue.]

? You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, orunsubscribe.[AB55AVLHZPRSSKG7KCV6D73QSFVEPA5CNFSM4JJCI7A2YY3PNVWWK3TUL52HS4 DFUVEXG43VMWVGG33NNVSW45C7NFSM4HW4MBBA.gif]

donboyd5 commented 4 years ago

I am not saying Tax-Calculator requires E00100. (I doubt it does.) I'm saying that TaxData requires E00100.

Is taxsim able to calculate E00100 (or its counterpart c00100) for 2011 using the synthetic PUF?

If so, would you mind doing that and providing E00100 if I point you to the right synthetic PUF?

Thanks.

Don

On Tue, Nov 5, 2019 at 8:53 AM Daniel Feenberg notifications@github.com wrote:

The tax calculator should not require E00100. Think about it. If you want to calculate the revenue from a change in the definition of AGI, then if the calculator uses E00100 from the data instead of the the calculated AGI, it will get the wrong answer. This should be fixed in the calculator.

Now, if the E00100 is only used as a tab variable in the display, my argument is only a little weaker. I wonder if E00100 influences the tax calculation at all in the current code. If it did, wouldn't it have come to light as an error in some score? Maybe it is on a list of required variables, but doesn't actually influence the calculation.

Dan

On Tue, 5 Nov 2019, Don Boyd wrote:

Of interest to at least the following: @andersonfrailey, @MattHJensen, @martinholmer, @MaxGhenis, @feenberg, @marshallpku, and @gchen3

Despite well-laid plans, we have run into a significant problem running a synthetic version of the PUF through TaxData.

I think there is a short-term solution that will not require changes to TaxData and eventually could be incorporated into synthesis procedures.

My questions are: Will my proposed solution work? Is there a better short-term solution? Is there someone willing to assist in the short-term solution? @feenberg, if my workaround number 1 works, I would much appreciate your help.

Here is the problem: TaxData requires E00100. We have not synthesized E00100 and even if we did it would be wrong because it is a complex calculation based upon other variables we synthesize. It has to be calculated from those variables after they are synthesized.

This creates a circular problem:

  • We want to run the PUF-format synthetic PUF through TaxData so that it can be used in Tax-Calculator.
  • To do that, we need E00100.
  • But we cannot synthesize E00100, we must calculate it.
  • Tax-Calculator can construct the calculated counterpart to E00100, c00100 (albeit not for the desired year of 2011, but for a close substitute, 2013).
  • But Tax-Calculator won't do this with a PUF-format file. It needs a file with certain of the enhancements ordinarily done by TaxData, particularly prime-spouse splitting of E00200, E00900, and E02100.
  • (Perhaps Tax-Calculator requires other enhancements made by TaxData but I don't know that.)

Workaround number 1:

  • @feenberg would it be possible to run the synthetic no-disclosures PUF through taxsim and get AGI that way? If so we would just name it E00100, add it to the synthetic PUF file, and call it a day (for now).
  • I really hope this is the solution.

If not, then ... workaround number 2:

  • It is important to VERIFY: Why do we need E00100 in TaxData if Tax-Calculator will calculate its own version, c00100? Answering this will tell us something about how faithful the AGI calculation needs to be to AGI that IRS would have calculated if it had constructed the synthetic PUF. I think E00100 is probably used in TaxData to help ensure that other values make sense - for example it probably is used in statistical matching against the CPS, or in determining benefit amounts, or in the targets used for reweighting. If these are the uses of E00100, then perhaps a calculation that is 95% or 98% faithful to the actual 2011 calculation (2011 is the year of our PUF) will be adequate.
  • @andersonfrailey can you elaborate on the important uses of E00100 in TaxData?
  • If my suspicion above is correct that E00100 is used for the reasons I gave, then with someone else's help, we might write a slimmed-down version of an AGI-calculation function -- or even pseudocode -- that relies only on variables found in the PUF. I assume this would be a version of AGI in calcfunctions.py in Tax-Calculator.
  • It would be specific to 2011, ideally but if need be I think in the short term it could be based on 2013. (I believe - subject to refutation by people who have looked more closely - that the changes in AGI definition between 2011 and 2013 were extremely minor.) Please keep in mind that our purpose here is testing a fully TaxData-enhanced synthetic PUF within Tax-Calculator and we need some sort of workaround to get us there. That means we can be imperfect now. This is not our first shot at it: we will eventually be synthesizing improved synthetic PUFs and as we do that we can improve our approach to this issue.
  • The problem for me is that the AGI calculation in Tax-Calculator relies on interim calculations that would take me a long time to understand. Much better if someone else could say here's how we would do it with PUF-only variables, just for 2011 (or 2013).
  • I would then write and run this function in R (or python), pre-processing the synthetic PUF to create an acceptable E00100 before it is fed into TaxData.

I have a third suggested workaround if neither of these works but it is more work, and more complex, and would entail changes to a version of TaxData; first I think it would be good to get some reaction to these ideas.

Thoughts, anyone on how best to proceed?

[Aside: If you are wondering how I previously tested our synthetic PUF in Tax-Calculator without running it fully through TaxData and addressing this issue, here's what I did then. I created 50/50 splits of the prime-spouse variables on all married records, for both the true PUF and the synthetic PUF, which created a file that could go through Tax-Calculator and allowed apples-to-apples comparisons of PUF and synthetic PUF. That was fine for that purpose and the current issue did not arise as we did not need E00100. But now we want a full TaxData enhanced PUF, with realistic estimates of prime-spouse splits and with other TaxData enhancements so we have to address this issue.]

? You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, orunsubscribe.[AB55AVLHZPRSSKG7KCV6D73QSFVEPA5CNFSM4JJCI7A2YY3PNVWWK3TUL52HS4 DFUVEXG43VMWVGG33NNVSW45C7NFSM4HW4MBBA.gif]

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/PSLmodels/taxdata/issues/330?email_source=notifications&email_token=ABR4JGFSCXWDMXENQRH5MT3QSF3FRA5CNFSM4JJCI7A2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDC2U3I#issuecomment-549825133, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABR4JGDEXOWDPMRSGVTQA6TQSF3FRANCNFSM4JJCI7AQ .

feenberg commented 4 years ago

Sure, give me the file and I will add E00100. We could also add a few other calculated values.

dan

On Tue, 5 Nov 2019, Don Boyd wrote:

I am not saying Tax-Calculator requires E00100. (I doubt it does.) I'm saying that TaxData requires E00100.

Is taxsim able to calculate E00100 (or its counterpart c00100) for 2011 using the synthetic PUF?

If so, would you mind doing that and providing E00100 if I point you to the right synthetic PUF?

Thanks.

Don

On Tue, Nov 5, 2019 at 8:53 AM Daniel Feenberg notifications@github.com wrote:

The tax calculator should not require E00100. Think about it. If you want to calculate the revenue from a change in the definition of AGI, then if the calculator uses E00100 from the data instead of the the calculated AGI, it will get the wrong answer. This should be fixed in the calculator.

Now, if the E00100 is only used as a tab variable in the display, my argument is only a little weaker. I wonder if E00100 influences the tax calculation at all in the current code. If it did, wouldn't it have come to light as an error in some score? Maybe it is on a list of required variables, but doesn't actually influence the calculation.

Dan

On Tue, 5 Nov 2019, Don Boyd wrote:

Of interest to at least the following: @andersonfrailey, @MattHJensen, @martinholmer, @MaxGhenis, @feenberg, @marshallpku, and @gchen3

Despite well-laid plans, we have run into a significant problem running a synthetic version of the PUF through TaxData.

I think there is a short-term solution that will not require changes to TaxData and eventually could be incorporated into synthesis procedures.

My questions are: Will my proposed solution work? Is there a better short-term solution? Is there someone willing to assist in the short-term solution? @feenberg, if my workaround number 1 works, I would much appreciate your help.

Here is the problem: TaxData requires E00100. We have not synthesized E00100 and even if we did it would be wrong because it is a complex calculation based upon other variables we synthesize. It has to be calculated from those variables after they are synthesized.

This creates a circular problem:

  • We want to run the PUF-format synthetic PUF through TaxData so that it can be used in Tax-Calculator.
  • To do that, we need E00100.
  • But we cannot synthesize E00100, we must calculate it.
  • Tax-Calculator can construct the calculated counterpart to E00100, c00100 (albeit not for the desired year of 2011, but for a close substitute, 2013).
  • But Tax-Calculator won't do this with a PUF-format file. It needs a file with certain of the enhancements ordinarily done by TaxData, particularly prime-spouse splitting of E00200, E00900, and E02100.
  • (Perhaps Tax-Calculator requires other enhancements made by TaxData but I don't know that.)

Workaround number 1:

  • @feenberg would it be possible to run the synthetic no-disclosures PUF through taxsim and get AGI that way? If so we would just name it E00100, add it to the synthetic PUF file, and call it a day (for now).
  • I really hope this is the solution.

If not, then ... workaround number 2:

  • It is important to VERIFY: Why do we need E00100 in TaxData if Tax-Calculator will calculate its own version, c00100? Answering this will tell us something about how faithful the AGI calculation needs to be to AGI that IRS would have calculated if it had constructed the synthetic PUF. I think E00100 is probably used in TaxData to help ensure that other values make sense - for example it probably is used in statistical matching against the CPS, or in determining benefit amounts, or in the targets used for reweighting. If these are the uses of E00100, then perhaps a calculation that is 95% or 98% faithful to the actual 2011 calculation (2011 is the year of our PUF) will be adequate.
  • @andersonfrailey can you elaborate on the important uses of E00100 in TaxData?
  • If my suspicion above is correct that E00100 is used for the reasons I gave, then with someone else's help, we might write a slimmed-down version of an AGI-calculation function -- or even pseudocode -- that relies only on variables found in the PUF. I assume this would be a version of AGI in calcfunctions.py in Tax-Calculator.
  • It would be specific to 2011, ideally but if need be I think in the short term it could be based on 2013. (I believe - subject to refutation by people who have looked more closely - that the changes in AGI definition between 2011 and 2013 were extremely minor.) Please keep in mind that our purpose here is testing a fully TaxData-enhanced synthetic PUF within Tax-Calculator and we need some sort of workaround to get us there. That means we can be imperfect now. This is not our first shot at it: we will eventually be synthesizing improved synthetic PUFs and as we do that we can improve our approach to this issue.
  • The problem for me is that the AGI calculation in Tax-Calculator relies on interim calculations that would take me a long time to understand. Much better if someone else could say here's how we would do it with PUF-only variables, just for 2011 (or 2013).
  • I would then write and run this function in R (or python), pre-processing the synthetic PUF to create an acceptable E00100 before it is fed into TaxData.

I have a third suggested workaround if neither of these works but it is more work, and more complex, and would entail changes to a version of TaxData; first I think it would be good to get some reaction to these ideas.

Thoughts, anyone on how best to proceed?

[Aside: If you are wondering how I previously tested our synthetic PUF in Tax-Calculator without running it fully through TaxData and addressing this issue, here's what I did then. I created 50/50 splits of the prime-spouse variables on all married records, for both the true PUF and the synthetic PUF, which created a file that could go through Tax-Calculator and allowed apples-to-apples comparisons of PUF and synthetic PUF. That was fine for that purpose and the current issue did not arise as we did not need E00100. But now we want a full TaxData enhanced PUF, with realistic estimates of prime-spouse splits and with other TaxData enhancements so we have to address this issue.]

? You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, orunsubscribe.[AB55AVLHZPRSSKG7KCV6D73QSFVEPA5CNFSM4JJCI7A2YY3PNVWWK3TUL52H S4 DFUVEXG43VMWVGG33NNVSW45C7NFSM4HW4MBBA.gif]

? You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/PSLmodels/taxdata/issues/330?email_source=notifications &email_token=ABR4JGFSCXWDMXENQRH5MT3QSF3FRA5CNFSM4JJCI7A2YY3PNVWWK3TUL52HS 4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDC2U3I#issuecomment-549825133 , or unsubscribe https://github.com/notifications/unsubscribe-auth/ABR4JGDEXOWDPMRSGVTQA6TQ SF3FRANCNFSM4JJCI7AQ .

? You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, orunsubscribe.[AB55AVLNE2XLIAMZWK3Y6WTQSGCLTA5CNFSM4JJCI7A2YY3PNVWWK3TUL52HS4 DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDC6WBA.gif]

donboyd5 commented 4 years ago

Thanks. The file is synpuf20_no_disclosures_weighted.csv. (You've seen it before. The weight variable that one would use in analysis - not needed for this purpose - is S006_rwt.)

Regarding other variables: In my email conversation with Yimeng (@marshallpku) he said that there are 8 missing variables in the synthetic puf. He thought that E00100 was the only truly needed one. There were 3 other dollar-valued variables that were missing, and it would be great to have them just to be safe:

The other missing variables were XOCAH "Exemptions for Children Living at Home"; XOCAWH "Exemptions for Children Living Away from Home"; XOODEP "Exemptions for Other Dependents"; and XOPAR "Exemptions for Parents Living at Home or Away from Home" but it does not sound like we need them (and if we did we would synthesize them rather than calculate them) - @marshallpku, can you confirm?

Don

feenberg commented 4 years ago

I have placed a file with 7 calculated variables at:

http://www.nber.org/~feenberg/synpuf20calc.zip

It is a csv file. In addition to the requested fields I have added c09600 and c10300. Also, flpdyr, xfpt and xfst inferred from mars.

Lastly, I have included recid and e00200 just to make sure when you merge this with the synpuf20 that the merge goes correctly. Be sure to check.

Dan

On Tue, 5 Nov 2019, Don Boyd wrote:

Thanks. The file is synpuf20_no_disclosures_weighted.csv. (You've seen it before. The weight variable that one would use in analysis - not needed for this purpose - is S006_rwt.)

Regarding other variables: In my email conversation with Yimeng (@marshallpku) he said that there are 8 missing variables in the synthetic puf. He thought that E00100 was the only truly needed one. There were 3 other dollar-valued variables that were missing, and it would be great to have them just to be safe:

  • E02500, "Social Security benefits in AGI"
  • E03260, "Deduction for self-employment tax"
  • E04800, "Taxable income"

The other missing variables were XOCAH "Exemptions for Children Living at Home"; XOCAWH "Exemptions for Children Living Away from Home"; XOODEP "Exemptions for Other Dependents"; and XOPAR "Exemptions for Parents Living at Home or Away from Home" but it does not sound like we need them (and if we did we would synthesize them rather than calculate them) - @marshallpku, can you confirm?

Don

? You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, orunsubscribe.[AB55AVOQXZMO27T7X2NLMR3QSGMSLA5CNFSM4JJCI7A2YY3PNVWWK3TUL52HS4 DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDDJFJA.gif]

donboyd5 commented 4 years ago

@feenberg solved our E00100 problem by running the synthetic data through taxsim and creating a file with E00100. (Thanks!)

@marshallpku was able to successfully run the synthetic file with E00100 through TaxData.

I am not closing this now, because our short-term problem is solved but we still will want a better long-term solution. Will add more when we are closer to the point of needing it.