donboyd5 / synpuf

Synthetic PUF
MIT License
4 stars 3 forks source link

Ensure e00600 >= e00650 and e01500 >= e01700 #17

Open MaxGhenis opened 5 years ago

MaxGhenis commented 5 years ago

This is a condition to run Tax-Calculator, and as reported by @donboyd5, the current synthesis throws the error:

ValueError: expression "e00600 >= e00650" is not true for every record

Don proposed two approaches:

1) Compute non-qualified dividends as nqd=e00600 - e00650, then synthesize nqd and e00650 independently, then compute the total as the sum of the two. This is the way the synthpop authors tend to do this, per my email conversation with them.

2) compute the ratio of qualified to total -- ratio=e00650 / e00600 synthesize e00600 and ratio independently compute e00650 from the two I found this to yield higher quality results, but I did not examine it rigorously.

I think we'll need more logic for each of these methods, since the model could still produce cases where nqd < 0 or ratio < 1. In particular, nqd should produce identical outcomes to the current approach of synthesizing e00600 and e00650 separately on average. So we'd probably want to amend these approaches to nqd=max(nqd, 0) (equivalent to e00600=max(e00650, e00600)) or ratio=max(ratio, 1).

That said, ideally the prediction models would figure out this relationship, so it'll be worth re-evaluating after increasing the number of trees (currently 20).

I'd suggest starting with e00600=max(e00650, e00600) as an interim approach.

donboyd5 commented 5 years ago

That sounds like a fine interim approach to me. We can do that with the synthesized file until you implement something else in the synthesis phase.

Here is a table with the distribution of the 4 variables (nqd and ratio calculated as per above) in the training (full puf) set and the first-cut synthesis. It's not a terribly common problem:

image

One simple way to impose the logic is to use the ratio approach (i.e., predict e00650 and also the ratio e00650 / e00600), fitting and predicting the ratio with CART. Because all of the ratio values computed from the training data will be in [0, 1], the fitted values also will be in [0, 1], and we won't have a problem (that's one reason I used the ratio approach).

MaxGhenis commented 5 years ago

Ah you're right, I suppose tree-based methods should make your approaches work, even if the nqd approach would be equivalent to the current behavior in linear models.

donboyd5 commented 5 years ago

I've been going through the command-line-interface documentation (https://pslmodels.github.io/Tax-Calculator/) to make sure I prepare the synthesized file properly to run through Tax-Calculator.

Per the documentation, one other variable set raises the same kind of issue as dividends. Tax-Calculator:

also expects that the value of total pension and annuity income (e01500) will be no less than the value of taxable pension and annuity income (e01700) for each filing unit

We can use the same sort of interim approach for this for now, too, adjusting the total (e01500) after synthesis, as needed. I don't think we need a separate issue for this.

The documentation also mentions splitting wages and 2 other variables between prime and spouse. I do that after synthesis, and I think we always will want that to occur in a post synthesis stage (as part of taxdata, normally).

MaxGhenis commented 5 years ago

synpuf5 and 6 fix this:

(synth.E00600 < synth.E00650).sum()  # 0
(synth.E01500 < synth.E01700).sum()  # 0

@donboyd5 could you test it out with Tax-Calculator?

FYI these files also use several more seeds for https://github.com/donboyd5/synpuf/issues/21 and use 50 rather than 20 trees (runtime near 3 hours on the full set).

donboyd5 commented 5 years ago

Yes, tomorrow.

On Fri, Dec 14, 2018 at 7:23 PM Max Ghenis notifications@github.com wrote:

synpuf5 and 6 fix this:

(synth.E00600 < synth.E00650).sum() # 0

@donboyd5 https://github.com/donboyd5 could you test it out with Tax-Calculator?

FYI these files also use several more seeds for #21 https://github.com/donboyd5/synpuf/issues/21 and use 50 rather than 20 trees (runtime near 3 hours on the full set).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/donboyd5/synpuf/issues/17#issuecomment-447517431, or mute the thread https://github.com/notifications/unsubscribe-auth/AGPEmE8cVNciHSaEyusoRe3sIu9Da-lWks5u5EEUgaJpZM4ZOC4O .

donboyd5 commented 5 years ago

synpuf in its entirety runs through Tax-Calculator without a hitch, without needing any adjustments to the variables above.