Potentially incorrect imputations depending on feature order in variable schema

tammandres commented 2 years ago

Hi,

I noticed an unexpected imputation behaviour that can be illustrated with the following example:

I create three independent (uncorrelated) features f0, f1 and f2. Features f0 and f2 are Gaussian, f1 is Exponential. Furthermore, f1 and f2 have clearly different means.
I induce missingness completely at random in feature f0, and then use f1 and f2 to impute it.
If I specify a variable schema {'f0':['f2, 'f1']}, then imputed values are too low.
If I specify a schema {'f0':['f1, 'f2']}, then imputed values are as expected.
This behaviour happens when f2 and f1 have different means. It also occurs when features f1 and f2 are Gaussian, but is more clear when one is exponential.

The order of f1 and f2 should not matter, because there are no missing values in these features. It should also not matter because f1 and f2 carry no information about f0. I wonder if this could be a bug? I hope I have not overlooked anything obvious!

Thanks, Andres

Code to illustrate this:

# Get data with three independent features - f0,f1,f2
# f0 and f2 are Gaussian, f1 is Exponential
n = 1000
rng0 = np.random.default_rng(seed=42)
rng1 = np.random.default_rng(seed=52)
rng2 = np.random.default_rng(seed=62)
f0 = rng0.normal(loc=0, scale=1, size=n)
f1 = rng1.exponential(scale=1, size=n)
f2 = rng2.normal(loc=10, scale=1, size=n)
plt.hist(f0, alpha=0.5, label='Feature 0')
plt.hist(f1, alpha=0.5, label='Feature 1')
plt.hist(f2, alpha=0.5, label='Feature 2')
plt.legend()
plt.show()
d = np.vstack([f0,f1,f2]).transpose()
d = pd.DataFrame(d, columns=['f0', 'f1', 'f2'])
print('\nDataset:\n{}'.format(d.describe()))

# Induce missingness completely at random in feature 0
idx_mis = rng.choice(np.arange(n), int(np.floor(n/2)))
d.iloc[idx_mis,0] = np.nan

# Impute: f2 before f1 in schema - not good result
# Imputed values are lower than the mean
var_sch = {'f0':['f2', 'f1']}
print('\nVariable schema: {}'.format(var_sch))
kernel = mf.ImputationKernel(d, datasets=1, random_state=42, 
                             mean_match_scheme=mean_match_default,
                             variable_schema=var_sch)
kernel.mice(5)
d_imp = kernel.complete_data(0)
d_imp = pd.concat(objs=[d_imp.f0, d.f0], axis=1)
d_imp.columns = ['f0 (imputed)', 'f0']
print('Distribution of imputed feature vs unimputed feature:\n\n{}'.format(d_imp.describe()))
kernel.plot_imputed_distributions(wspace=1,hspace=0.5)

# Impute: f1 before f2 in schema - good result
var_sch = {'f0':['f1', 'f2']}
print('\nVariable schema: {}'.format(var_sch))
kernel = mf.ImputationKernel(d, datasets=1, random_state=42, 
                             mean_match_scheme=mean_match_default,
                             variable_schema=var_sch)
kernel.mice(5)
d_imp = kernel.complete_data(0)
d_imp = pd.concat(objs=[d_imp.f0, d.f0], axis=1)
d_imp.columns = ['f0 (imputed)', 'f0']
print('Distribution of imputed feature vs unimputed feature:\n\n{}'.format(d_imp.describe()))
kernel.plot_imputed_distributions(wspace=1,hspace=0.5)

Output I get on my computer (without plots):

Dataset:
                f0           f1           f2
count  1000.000000  1000.000000  1000.000000
mean     -0.028892     0.933777     9.987793
std       0.989217     0.951335     0.968862
min      -3.648413     0.000646     7.231604
25%      -0.696313     0.285916     9.253244
50%       0.006178     0.648893    10.026187
75%       0.589887     1.216675    10.637074
max       3.178854     7.129019    13.086357

Variable schema: {'f0': ['f2', 'f1']}
Distribution of imputed feature vs unimputed feature:

       f0 (imputed)          f0
count   1000.000000  599.000000
mean      -0.883467   -0.011545
std        1.337261    1.007416
min       -2.964529   -2.964529
25%       -1.861845   -0.690939
50%       -1.033285   -0.005122
75%        0.221387    0.638580
max        2.905067    2.905067

Variable schema: {'f0': ['f1', 'f2']}
Distribution of imputed feature vs unimputed feature:

       f0 (imputed)          f0
count   1000.000000  599.000000
mean       0.006233   -0.011545
std        0.977652    1.007416
min       -2.964529   -2.964529
25%       -0.654283   -0.690939
50%        0.011445   -0.005122
75%        0.637487    0.638580
max        2.905067    2.905067

AnotherSamWilson commented 2 years ago

Great catch, here's what's going on. Making the training feature set sorts the features by index. However, getting bachelor predictions doesn't..

I will force the variable schema to sort the predictor variables, if they are passed. Good catch though.

AnotherSamWilson commented 2 years ago

I pushed a fix which you can download from this repo. I plan on pushing this to pypi later tonight. It fixed the problem this code was causing specifically, if you are still having problems in your real code, let me know.

tammandres commented 2 years ago

Thank you for your quick reply and help! I was just trying to get around this issue by making the columns in my variable schema to have the same order as in the original dataframe. I initially stumbled on this, because I used np.setdiff1d to remove the feature itself from the list of predictor variables, but setdiff sorts the results ... I am really glad this issue is clear before continuing my analysis of a healthcare dataset! 😅

AnotherSamWilson commented 2 years ago

For future reference, variable_schema will by default use all other columns to predict each variable that has missing values. If you are going this route, it would probably be easier to just leave variable_schema as None.

tammandres commented 2 years ago

It is good to know that, though I still wanted to use the schema, as I wanted to exclude a subset of columns from the imputation models

AnotherSamWilson / miceforest

Potentially incorrect imputations depending on feature order in variable schema #54