Open MaxGhenis opened 5 years ago
That sounds like a fine interim approach to me. We can do that with the synthesized file until you implement something else in the synthesis phase.
Here is a table with the distribution of the 4 variables (nqd and ratio calculated as per above) in the training (full puf) set and the first-cut synthesis. It's not a terribly common problem:
One simple way to impose the logic is to use the ratio approach (i.e., predict e00650 and also the ratio e00650 / e00600), fitting and predicting the ratio with CART. Because all of the ratio values computed from the training data will be in [0, 1], the fitted values also will be in [0, 1], and we won't have a problem (that's one reason I used the ratio approach).
Ah you're right, I suppose tree-based methods should make your approaches work, even if the nqd
approach would be equivalent to the current behavior in linear models.
I've been going through the command-line-interface documentation (https://pslmodels.github.io/Tax-Calculator/) to make sure I prepare the synthesized file properly to run through Tax-Calculator.
Per the documentation, one other variable set raises the same kind of issue as dividends. Tax-Calculator:
also expects that the value of total pension and annuity income (e01500) will be no less than the value of taxable pension and annuity income (e01700) for each filing unit
We can use the same sort of interim approach for this for now, too, adjusting the total (e01500) after synthesis, as needed. I don't think we need a separate issue for this.
The documentation also mentions splitting wages and 2 other variables between prime and spouse. I do that after synthesis, and I think we always will want that to occur in a post synthesis stage (as part of taxdata, normally).
synpuf5 and 6 fix this:
(synth.E00600 < synth.E00650).sum() # 0
(synth.E01500 < synth.E01700).sum() # 0
@donboyd5 could you test it out with Tax-Calculator?
FYI these files also use several more seeds for https://github.com/donboyd5/synpuf/issues/21 and use 50 rather than 20 trees (runtime near 3 hours on the full set).
Yes, tomorrow.
On Fri, Dec 14, 2018 at 7:23 PM Max Ghenis notifications@github.com wrote:
synpuf5 and 6 fix this:
(synth.E00600 < synth.E00650).sum() # 0
@donboyd5 https://github.com/donboyd5 could you test it out with Tax-Calculator?
FYI these files also use several more seeds for #21 https://github.com/donboyd5/synpuf/issues/21 and use 50 rather than 20 trees (runtime near 3 hours on the full set).
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/donboyd5/synpuf/issues/17#issuecomment-447517431, or mute the thread https://github.com/notifications/unsubscribe-auth/AGPEmE8cVNciHSaEyusoRe3sIu9Da-lWks5u5EEUgaJpZM4ZOC4O .
synpuf in its entirety runs through Tax-Calculator without a hitch, without needing any adjustments to the variables above.
This is a condition to run Tax-Calculator, and as reported by @donboyd5, the current synthesis throws the error:
Don proposed two approaches:
I think we'll need more logic for each of these methods, since the model could still produce cases where
nqd < 0
orratio < 1
. In particular,nqd
should produce identical outcomes to the current approach of synthesizinge00600
ande00650
separately on average. So we'd probably want to amend these approaches tonqd=max(nqd, 0)
(equivalent toe00600=max(e00650, e00600)
) orratio=max(ratio, 1)
.That said, ideally the prediction models would figure out this relationship, so it'll be worth re-evaluating after increasing the number of trees (currently 20).
I'd suggest starting with
e00600=max(e00650, e00600)
as an interim approach.