djsutherland / pummeler

Utilities to analyze ACS PUMS files, especially for distribution regression / ecological inference
MIT License
21 stars 7 forks source link

Log transform for US$ variables #13

Open flaxter opened 7 years ago

flaxter commented 7 years ago

Here's the variables I think we should log transform, all representing income/wages/etc.

VERSIONS = {
...
    'log_transform_feats': '''INTP OIP PAP RETP SEMP SSIP SSP WAGP PERNP
                            PINCP'''.split(),

Only issue is that some of these variables can be negative (for losses). So I guess the transformation for those should be x = log(x - min(x)) or something?

Once we figure that out it should be easy to put this into get_dummies.

djsutherland commented 7 years ago

I think in my pre-pummeler attempt at this I did sign(x) * log(x + 1*sign(x)) or something. log(x - min(x)) isn't shaped very nicely if min(x) is, say, -915,729,293.

flaxter commented 7 years ago

don't understand... 1+sign(x)?

I just looked through the codebook more carefully. Most (all?) of these are truncated below ("Rounded & bottom-coded") so I think something like my solution actually makes sense. Sure, it won't be a normal distribution, but if we're featurizing using KDE than it'll just have a weird bump in the lower tail. Of course my solution doesn't work when x = min(x) so I guess now I'm proposing:

log(x - min(x) + 1)

djsutherland commented 7 years ago

I was a little off before: what I want is sign(x) * log( |x| + 1 ), which maintains both sign information and magnitude information. Doing log(x - min(x) + 1) is weird because it conflates very-negative incomes with slightly-negative incomes, while the amount that moderate incomes are conflated depends on what the min is.

flaxter commented 7 years ago

OK, finally went through case-by-case using the sampled data. Here are the only two monetary variables that I found that can actually be negative:

So maybe we just do categorical variables for whether INTP/SEMP are non-zero? But I still don't know what transform to use for positive / negative. Here are our two proposals, neither looks great:

semp intp

flaxter commented 7 years ago

Update: forgot about PERNP, which can also be negative. Or have true zeros (no earnings)?

Also what's RACNUM = Number of major race groups represented 1..6 .Race groups mean?

djsutherland commented 7 years ago

IIRC RACNUM is the flag for how many racial groups the person has indicated, with RAC1P the first race, RAC2P the second, etc.