Open renzilin opened 4 years ago
I have the same question as @renzilin. The second one is, how to play with cat_vars if data have missing values both in categorical features and continuous features?
Kind Regards, Bartosz
Had same doubt
Hi all, just try to encode categorical as integer or use TargetEncoding as a workaround. Seems this package has no special pre-processing for categorical vars.
What about encoding them into integer labels?
Actually, I'm curious about why suggesting one-hot encoding the categorical variables @ashimb9, both in the Readme
file and another issue (https://github.com/epsilon-machine/missingpy/issues/31). I've read the following source code in missforest.py
. It seems random forest classifier and regressor are applied for categorical and continuous variables, respectively. I did not figure out the necessity for a dummy adjustment. Also, imputation after one-hot encoding may lead to unrealistic results (more than one 1's for multiple columns representing one variable).
if self.cat_vars_ is not None and s in self.cat_vars_:
rf_classifier.fit(X=xobs, y=yobs)
# 7. predict ymis(s) using xmis(x)
ymis = rf_classifier.predict(xmis)
# 8. update imputed matrix using predicted matrix ymis(s)
Ximp[mis_rows, s] = ymis
else:
rf_regressor.fit(X=xobs, y=yobs)
# 7. predict ymis(s) using xmis(x)
ymis = rf_regressor.predict(xmis)
# 8. update imputed matrix using predicted matrix ymis(s)
Ximp[mis_rows, s] = ymis
Dear author,
In your introduction, the categorical features need to be one-hot format. Could you please provide an example to explain it?
Because, I'm not sure how to work on several categorical features? For instance, I have a data including 2 categorical features and the number of classes is 2 for feature1, 3 for feature2. Should I make them into a vector with 0 or 1? Like the length of vector is 5, the first two 0/1 is used to denote feature1 (2 classes), then the 3rd to 5th 0/1 is used to denote feature2 (3 classes).
Thanks for your help!
Best, Zilin