Closed jfleh closed 1 year ago
You just need to make sure the column types are the same. Instead of using col.astype('category')
use col.astype(gender_dtype)
where gender_dtype
was lifted from the original dataset.
If two categorical datatypes don't have the same categories, they are seen as different datatypes.
But this will not work when the new data has realizations that have not been seen in the original set of data, in this case I guess I will have to pre-specify the categories or match to the union of second and first?
Yes, exactly. This is because of the mean matching. If you try to insert a value into a categorical column, when that category does not exist, it will fail. It is up to the user to robustly define the categories.
In general, "new" categories in an inference set should be dealt with intentionally - lightgbm can get predictions from new categories, but it is not robust.
Thanks for the responses, makes sense. A bit unrelated, but is it possible to set the mean match candidates on a per column basis, i.e. default to 5 candidates but if a columns has less than 20 non missing values, use 2 candidates?
Yes this is possible, see the example here
I am having the same problem. I tried to specify the categories from train and test set as suggested but I got "AssertionError: B has unused categories: coconut". How should I deal with it and why are unused categories a problem?
import miceforest as mf
import pandas as pd
import numpy as np
df_train = pd.DataFrame({'A': [600, 20, np.nan, 400, 56, 75],
'B': ['apple', "orange", np.nan, 'apple', 'banana', 'apple']})
df_test = pd.DataFrame({'A': [np.nan, 2, 3, 4, 5],
'B': [np.nan, 'apple', 'banana', 'apple', 'coconut']})
all_categories = pd.concat([df_train['B'], df_test['B']]).astype('category').cat.categories
df_train['B'] = df_train['B'].astype('category').cat.set_categories(all_categories)
df_test['B'] = df_test['B'].astype('category').cat.set_categories(all_categories)
kds = mf.ImputationKernel(
df_train,
datasets=2,
save_all_iterations=True,
random_state=1
)
kds.mice(2)
train_complete = kds.complete_data()
test_complete = kds.impute_new_data(df_test).complete_data()
When there is a categorical column in a dataframe that an ImputationKernel is created with, and later new data should be imputed with that kernel, the imputation fails if not all categories are present in the new data. For example in a situation where male/female are coded with 0 and 1, and there are only females in the new data. This is example code that shows the issue: ` import pandas as pd import numpy as np import miceforest as mf
df = pd.DataFrame([[1, np.nan, 3], [2, 3, 4],[3, 4, 5], [4, 5, 6]], columns=["a", "b", "c"]) df1 = df.copy() df1["b"] = [np.nan, 3, 3, 4] df["b"] = df["b"].astype("category") df1["b"] = df1["b"].astype("category")
mms = mf.mean_match_default.copy() mms.set_mean_match_candidates(1) kernel = mf.ImputationKernel(df, mean_match_scheme=mms) kernel.mice(1) kernel.impute_new_data(df1) ` Traceback (most recent call last): File "", line 1, in
File "/home/user/.local/lib/python3.11/site-packages/miceforest/ImputationKernel.py", line 1598, in impute_new_data
assert all(
AssertionError: Column types are not the same as the original data. Check categorical columns.
Expected behavior is that the data is imputed without error as long as the new data contains a subset of the known data.