donboyd5 / synpuf

Synthetic PUF
MIT License
4 stars 3 forks source link

Synthesize binary and categorical features as strings or seeds #21

Closed donboyd5 closed 5 years ago

donboyd5 commented 5 years ago

The synthesized MARS values are non-integer when they should be integer, and occasionally they fall far from the nearest integer. I round the values to the nearest integer.

MaxGhenis commented 5 years ago

I expanded this issue to include other binary and categorical features, which should be synthesized either as seeds or as strings to avoid decimals. Here's my proposal for features with cardinality < 10, also captured in the pufvars Google sheet:

vname vdesc Cardinality Synthesis method Description booklet entry (as needed)
dsi Dependent Status Indicator 2 Seed  Taxpayer not being claimed as a dependent on another tax return: 0 Taxpayer claimed as a dependent on another tax return: 1
f6251 Form 6251, Alternative Minimum Tax 2 Classification  
midr Married Filing Separately Itemized Deductions Requirement Indicator 2 Classification  
fded Form of Deduction Code 3 Classification Aggregated Return: 0 Itemized deductions: 1 Standard deduction:2 Taxpayer did not use itemized or standard deduction: 3
eic Earned Income Credit Code 4 Regression No children claimed: 0 One child claimed: 1 Two children claimed: 2 Three children claimed: 3
f2441 Form 2441, Child Care Credit Qualified Individual 4 Regression No Form 2441 attached to return: 0 Number of qualifying individuals: 1-3
mars Marital (Filing) Status 4 Seed  
n24 Number of Children for Child Tax Credit 4 Regression  
xtot Total Exemptions 6 Regression  

We'll test out different specifications of seed vs. classification so that's less important right now. Does this sound right, in that all regression features will be rounded? I think capturing the ordinal nature of low-cardinality features like n24 is more important than avoiding rounding. I'm not aware of ordinal logistic regression for RF and trees, but that could also be an option for linear models down the line.

MaxGhenis commented 5 years ago

synpuf5 and 6 use all the classification/seed variables in the above table as seed variables, as I need to revise the rf_synth function to support classification. Other variables are rounded.

These datasets also fix https://github.com/donboyd5/synpuf/issues/17 and use 50 instead of 20 trees.