epsilon-machine / missingpy

Missing Data Imputation for Python
GNU General Public License v3.0
240 stars 64 forks source link

Impute the categorical features #21

Open renzilin opened 4 years ago

renzilin commented 4 years ago

Dear author,

In your introduction, the categorical features need to be one-hot format. Could you please provide an example to explain it?

Because, I'm not sure how to work on several categorical features? For instance, I have a data including 2 categorical features and the number of classes is 2 for feature1, 3 for feature2. Should I make them into a vector with 0 or 1? Like the length of vector is 5, the first two 0/1 is used to denote feature1 (2 classes), then the 3rd to 5th 0/1 is used to denote feature2 (3 classes).

Thanks for your help!

Best, Zilin

oliwabartosz commented 3 years ago

I have the same question as @renzilin. The second one is, how to play with cat_vars if data have missing values both in categorical features and continuous features?

Kind Regards, Bartosz

VijCodes commented 2 years ago

Had same doubt

nabito commented 2 years ago

Hi all, just try to encode categorical as integer or use TargetEncoding as a workaround. Seems this package has no special pre-processing for categorical vars.

brtang63 commented 1 year ago

What about encoding them into integer labels?

Actually, I'm curious about why suggesting one-hot encoding the categorical variables @ashimb9, both in the Readme file and another issue (https://github.com/epsilon-machine/missingpy/issues/31). I've read the following source code in missforest.py. It seems random forest classifier and regressor are applied for categorical and continuous variables, respectively. I did not figure out the necessity for a dummy adjustment. Also, imputation after one-hot encoding may lead to unrealistic results (more than one 1's for multiple columns representing one variable).

if self.cat_vars_ is not None and s in self.cat_vars_:
    rf_classifier.fit(X=xobs, y=yobs)
    # 7. predict ymis(s) using xmis(x)
    ymis = rf_classifier.predict(xmis)
    # 8. update imputed matrix using predicted matrix ymis(s)
    Ximp[mis_rows, s] = ymis
else:
    rf_regressor.fit(X=xobs, y=yobs)
    # 7. predict ymis(s) using xmis(x)
    ymis = rf_regressor.predict(xmis)
    # 8. update imputed matrix using predicted matrix ymis(s)
    Ximp[mis_rows, s] = ymis