Open flaxter opened 7 years ago
I think in my pre-pummeler attempt at this I did sign(x) * log(x + 1*sign(x))
or something. log(x - min(x))
isn't shaped very nicely if min(x)
is, say, -915,729,293.
don't understand... 1+sign(x)?
I just looked through the codebook more carefully. Most (all?) of these are truncated below ("Rounded & bottom-coded") so I think something like my solution actually makes sense. Sure, it won't be a normal distribution, but if we're featurizing using KDE than it'll just have a weird bump in the lower tail. Of course my solution doesn't work when x = min(x) so I guess now I'm proposing:
log(x - min(x) + 1)
I was a little off before: what I want is sign(x) * log( |x| + 1 )
, which maintains both sign information and magnitude information. Doing log(x - min(x) + 1)
is weird because it conflates very-negative incomes with slightly-negative incomes, while the amount that moderate incomes are conflated depends on what the min is.
OK, finally went through case-by-case using the sampled data. Here are the only two monetary variables that I found that can actually be negative:
INTP
(Interest, dividends, and net rental income) has a bunch of true zeros ("None"). Only 0.2% were negative.SEMP
(Self-employment income) is same as INTP
, with even more true zeros. Again only 0.2% were negative (correlated with INTP
?)So maybe we just do categorical variables for whether INTP/SEMP are non-zero? But I still don't know what transform to use for positive / negative. Here are our two proposals, neither looks great:
Update: forgot about PERNP
, which can also be negative. Or have true zeros (no earnings)?
Also what's RACNUM
= Number of major race groups represented
1..6 .Race groups
mean?
IIRC RACNUM
is the flag for how many racial groups the person has indicated, with RAC1P
the first race, RAC2P
the second, etc.
Here's the variables I think we should log transform, all representing income/wages/etc.
Only issue is that some of these variables can be negative (for losses). So I guess the transformation for those should be x = log(x - min(x)) or something?
Once we figure that out it should be easy to put this into get_dummies.