epsilon-machine / missingpy

Missing Data Imputation for Python
GNU General Public License v3.0
236 stars 63 forks source link

Fit function takes cat_vars variable as input but doesnot really use it. #31

Open mankantega opened 3 years ago

mankantega commented 3 years ago

The first thing the fit function does is that it checks if all the columns are float, line 440 in the missingforest.py file. What is the point of having cat_vars then? If all the columns should be float64 why and how should one specify the cat_vars and actually use it?

ashimb9 commented 2 years ago

Please one-hot encode the categorical variables (aka dummy encoding) so that the cat variables are represented as numeric variables. Ex: categorical variable could be 'color_name' which is either of ('red', 'blue'). One hot encoded version could be something like: 'is_color_red' in (0, 1) where 0 implies blue while 1 implies red, for instance.

nabito commented 2 years ago

@mankantega I've a similar experience, however in my case I tried encoding the type of categorical variables to integer and then specify cat_vars to which the function happily accepted it (same as other type such as boolean, and object type column).

@ashimb9 do you know how missingpy internally handling categorical variable? I know that many random forest based library support specialized treatment/encoding of categorical variables for best performance. Most tree-based algorithm is prone to bias/too much tree depth using one hot encoding for categorical variable (especially when there are large categories). wonder why you suggest one-hot though.

brtang63 commented 1 year ago

@nabito It seems missingpy handles categorical variables by fitting a random forest classifier instead of a regressor, see my comment in another issue for details. @ashimb9 Is integer label encoding valid for missingpy? I think it may be more adequate than one-hot encoding in some cases. Thanks in advance.