donboyd5 / synpuf

Synthetic PUF
MIT License
4 stars 3 forks source link

Should we tweak variables to capture structural relationships between them? #35

Open MaxGhenis opened 5 years ago

MaxGhenis commented 5 years ago

We currently avoid synthesizing some invalid relationships like between wages and EITC by calculating variables like EITC via Tax-Calculator rather than synthesis.

We also tweak some variables to work better in synthesis, by modeling e00600 - e00650 and e01500 - e01700 rather than e00600 and e01500, respectively. This ensures that e00600>e00650 and e01500>e01700 as required by Tax-Calculator (see https://github.com/donboyd5/synpuf/issues/17).

This issue is to explore whether we should engineer other features to better capture relationships between synthesis, like the latter example. It is motivated by a recent call with Benedetto and Stinson from Census, where they recommended thinking through important structural relationships.

feenberg commented 5 years ago

On Mon, 7 Jan 2019, Max Ghenis wrote:

We currently avoid synthesizing some invalid relationships like between wages and EITC by calculating variables like EITC via Tax-Calculator rather than synthesis.

We also tweak some variables to work better in synthesis, by modeling e00600 - e00650 and e01500 - e01700 rather than e00600 and e01500, respectively. This ensures that e00600>e00650 and e01500>e01700 as required by Tax-Calculator (see #17).

This issue is to explore whether we should engineer other features to better capture relationships between synthesis, like the latter example.

I think there are 2 major places where this might be significant (I don't know if it really is, though). These would be itemization status and AMT status. It is possible that sythesis might produce insufficient itemized deductions to justify itemization in the appropriate number of taxpayers, and AMT income might be attributed to taxpayers with sufficient regular tax to avoid paying AMT. My suggestion is to synthesize extra records, and substitute from among the xtras for any synthetic records that can't balance.

Again, I am not sure this is really a big problem. If a taxpayer is synthesized with one deduction early in the process, then he is likely to be synthesized with other deductions. So the shortfall may be minor.

If we have some extra sythetic records, though, we can use them instead (and throw away the unused extra synthetic records).

Dan

MaxGhenis commented 5 years ago

Interesting, we are indeed synthesizing f6251 (Form 6251, Alternative Minimum Tax) and fded (Form of Deduction Code, itemized/standard/neither), both required by Tax-Calculator (spreadsheet). Does Tax-Calculator need these though, or could they be determined by whatever minimizes tax burden? Seems like this would be a valuable Tax-Calculator feature regardless of our project. @andersonfrailey