Closed donboyd5 closed 5 years ago
I expanded this issue to include other binary and categorical features, which should be synthesized either as seeds or as strings to avoid decimals. Here's my proposal for features with cardinality < 10, also captured in the pufvars Google sheet:
vname | vdesc | Cardinality | Synthesis method | Description booklet entry (as needed) |
---|---|---|---|---|
dsi | Dependent Status Indicator | 2 | Seed | Taxpayer not being claimed as a dependent on another tax return: 0 Taxpayer claimed as a dependent on another tax return: 1 |
f6251 | Form 6251, Alternative Minimum Tax | 2 | Classification | |
midr | Married Filing Separately Itemized Deductions Requirement Indicator | 2 | Classification | |
fded | Form of Deduction Code | 3 | Classification | Aggregated Return: 0 Itemized deductions: 1 Standard deduction:2 Taxpayer did not use itemized or standard deduction: 3 |
eic | Earned Income Credit Code | 4 | Regression | No children claimed: 0 One child claimed: 1 Two children claimed: 2 Three children claimed: 3 |
f2441 | Form 2441, Child Care Credit Qualified Individual | 4 | Regression | No Form 2441 attached to return: 0 Number of qualifying individuals: 1-3 |
mars | Marital (Filing) Status | 4 | Seed | |
n24 | Number of Children for Child Tax Credit | 4 | Regression | |
xtot | Total Exemptions | 6 | Regression |
We'll test out different specifications of seed vs. classification so that's less important right now. Does this sound right, in that all regression features will be rounded? I think capturing the ordinal nature of low-cardinality features like n24
is more important than avoiding rounding. I'm not aware of ordinal logistic regression for RF and trees, but that could also be an option for linear models down the line.
synpuf5 and 6 use all the classification/seed variables in the above table as seed variables, as I need to revise the rf_synth function to support classification. Other variables are rounded.
These datasets also fix https://github.com/donboyd5/synpuf/issues/17 and use 50 instead of 20 trees.
The synthesized MARS values are non-integer when they should be integer, and occasionally they fall far from the nearest integer. I round the values to the nearest integer.