Open mankantega opened 3 years ago
Please one-hot encode the categorical variables (aka dummy encoding) so that the cat variables are represented as numeric variables. Ex: categorical variable could be 'color_name' which is either of ('red', 'blue'). One hot encoded version could be something like: 'is_color_red' in (0, 1) where 0 implies blue while 1 implies red, for instance.
@mankantega I've a similar experience, however in my case I tried encoding the type of categorical variables to integer and then specify cat_vars
to which the function happily accepted it (same as other type such as boolean, and object type column).
@ashimb9 do you know how missingpy internally handling categorical variable? I know that many random forest based library support specialized treatment/encoding of categorical variables for best performance. Most tree-based algorithm is prone to bias/too much tree depth using one hot encoding for categorical variable (especially when there are large categories). wonder why you suggest one-hot though.
@nabito It seems missingpy
handles categorical variables by fitting a random forest classifier instead of a regressor, see my comment in another issue for details.
@ashimb9 Is integer label encoding valid for missingpy
? I think it may be more adequate than one-hot encoding in some cases. Thanks in advance.
The first thing the fit function does is that it checks if all the columns are float, line 440 in the missingforest.py file. What is the point of having cat_vars then? If all the columns should be float64 why and how should one specify the cat_vars and actually use it?