Synthesize binary and categorical features as strings or seeds

donboyd5 commented 5 years ago

The synthesized MARS values are non-integer when they should be integer, and occasionally they fall far from the nearest integer. I round the values to the nearest integer.

MaxGhenis commented 5 years ago

I expanded this issue to include other binary and categorical features, which should be synthesized either as seeds or as strings to avoid decimals. Here's my proposal for features with cardinality < 10, also captured in the pufvars Google sheet:

vname	vdesc	Cardinality	Synthesis method	Description booklet entry (as needed)
dsi	Dependent Status Indicator	2	Seed	Taxpayer not being claimed as a dependent on another tax return: 0 Taxpayer claimed as a dependent on another tax return: 1
f6251	Form 6251, Alternative Minimum Tax	2	Classification
midr	Married Filing Separately Itemized Deductions Requirement Indicator	2	Classification
fded	Form of Deduction Code	3	Classification	Aggregated Return: 0 Itemized deductions: 1 Standard deduction:2 Taxpayer did not use itemized or standard deduction: 3
eic	Earned Income Credit Code	4	Regression	No children claimed: 0 One child claimed: 1 Two children claimed: 2 Three children claimed: 3
f2441	Form 2441, Child Care Credit Qualified Individual	4	Regression	No Form 2441 attached to return: 0 Number of qualifying individuals: 1-3
mars	Marital (Filing) Status	4	Seed
n24	Number of Children for Child Tax Credit	4	Regression
xtot	Total Exemptions	6	Regression

We'll test out different specifications of seed vs. classification so that's less important right now. Does this sound right, in that all regression features will be rounded? I think capturing the ordinal nature of low-cardinality features like n24 is more important than avoiding rounding. I'm not aware of ordinal logistic regression for RF and trees, but that could also be an option for linear models down the line.

MaxGhenis commented 5 years ago

synpuf5 and 6 use all the classification/seed variables in the above table as seed variables, as I need to revise the rf_synth function to support classification. Other variables are rounded.

These datasets also fix https://github.com/donboyd5/synpuf/issues/17 and use 50 instead of 20 trees.

donboyd5 / synpuf

Synthesize binary and categorical features as strings or seeds #21