Imputation on new data fails for categorical columns if not all categories are present

jfleh commented 1 year ago

When there is a categorical column in a dataframe that an ImputationKernel is created with, and later new data should be imputed with that kernel, the imputation fails if not all categories are present in the new data. For example in a situation where male/female are coded with 0 and 1, and there are only females in the new data. This is example code that shows the issue: ` import pandas as pd import numpy as np import miceforest as mf

df = pd.DataFrame([[1, np.nan, 3], [2, 3, 4],[3, 4, 5], [4, 5, 6]], columns=["a", "b", "c"]) df1 = df.copy() df1["b"] = [np.nan, 3, 3, 4] df["b"] = df["b"].astype("category") df1["b"] = df1["b"].astype("category")

mms = mf.mean_match_default.copy() mms.set_mean_match_candidates(1) kernel = mf.ImputationKernel(df, mean_match_scheme=mms) kernel.mice(1) kernel.impute_new_data(df1) ` Traceback (most recent call last): File "", line 1, in File "/home/user/.local/lib/python3.11/site-packages/miceforest/ImputationKernel.py", line 1598, in impute_new_data assert all( AssertionError: Column types are not the same as the original data. Check categorical columns.

Expected behavior is that the data is imputed without error as long as the new data contains a subset of the known data.

AnotherSamWilson commented 1 year ago

You just need to make sure the column types are the same. Instead of using col.astype('category') use col.astype(gender_dtype) where gender_dtype was lifted from the original dataset.

If two categorical datatypes don't have the same categories, they are seen as different datatypes.

jfleh commented 1 year ago

But this will not work when the new data has realizations that have not been seen in the original set of data, in this case I guess I will have to pre-specify the categories or match to the union of second and first?

AnotherSamWilson commented 1 year ago

Yes, exactly. This is because of the mean matching. If you try to insert a value into a categorical column, when that category does not exist, it will fail. It is up to the user to robustly define the categories.

In general, "new" categories in an inference set should be dealt with intentionally - lightgbm can get predictions from new categories, but it is not robust.

jfleh commented 1 year ago

Thanks for the responses, makes sense. A bit unrelated, but is it possible to set the mean match candidates on a per column basis, i.e. default to 5 candidates but if a columns has less than 20 non missing values, use 2 candidates?

AnotherSamWilson commented 1 year ago

Yes this is possible, see the example here

fucelnad commented 8 months ago

I am having the same problem. I tried to specify the categories from train and test set as suggested but I got "AssertionError: B has unused categories: coconut". How should I deal with it and why are unused categories a problem?

import miceforest as mf
import pandas as pd
import numpy as np

df_train = pd.DataFrame({'A': [600, 20, np.nan, 400, 56, 75],
                         'B': ['apple', "orange", np.nan, 'apple', 'banana', 'apple']})
df_test = pd.DataFrame({'A': [np.nan, 2, 3, 4, 5],
                        'B': [np.nan, 'apple', 'banana', 'apple', 'coconut']})

all_categories = pd.concat([df_train['B'], df_test['B']]).astype('category').cat.categories
df_train['B'] = df_train['B'].astype('category').cat.set_categories(all_categories)
df_test['B'] = df_test['B'].astype('category').cat.set_categories(all_categories)

kds = mf.ImputationKernel(
  df_train,
  datasets=2,
  save_all_iterations=True,
  random_state=1
)

kds.mice(2)

train_complete = kds.complete_data()
test_complete = kds.impute_new_data(df_test).complete_data()

AnotherSamWilson / miceforest

Imputation on new data fails for categorical columns if not all categories are present #82