AnotherSamWilson / miceforest

Multiple Imputation with LightGBM in Python
MIT License
340 stars 31 forks source link

Bug in ImputationKernel: Handling of imputation_order #90

Open nmehran opened 1 week ago

nmehran commented 1 week ago

First off, thank you for maintaining such a useful library, I truly appreciate it.

A bug has been identified in the ImputationKernel class regarding the imputation_order parameter. The handling of "ascending" and "descending" orders is incorrect due to a typo and improper sorting configuration.

Problematic Code

# imputation_kernel.py
if imputation_order in ["ascending", "descending"]:
    _na_counts = {
        key: value
        for key, value in self.na_counts.items()
        if key in self.imputed_variables
    }
    self.imputation_order = list(
        Series(_na_counts).sort_values(ascending=False).index
    )
    if imputation_order == "decending":
        self.imputation_order.reverse()

Issues

Proposed Fix

Update the code as follows:

if imputation_order in ["ascending", "descending"]:
    _na_counts = {
        key: value
        for key, value in self.na_counts.items()
        if key in self.imputed_variables
    }
    self.imputation_order = list(
        Series(_na_counts).sort_values(ascending=True).index
    )
    if imputation_order == "descending":
        self.imputation_order.reverse()

Steps to Verify

Test the following code to compare imputation_order for "ascending" and "descending":

import pandas as pd
import numpy as np
import miceforest as mf
from sklearn.datasets import load_iris

# Load the Iris dataset
iris_data = load_iris()
iris_df = pd.DataFrame(iris_data.data, columns=iris_data.feature_names)

# Introduce missing values
np.random.seed(42)
missing_indices = np.random.choice(iris_df.index, size=10, replace=False)
iris_df.loc[missing_indices, 'sepal length (cm)'] = np.nan
iris_df.loc[np.random.choice(iris_df.index, size=8, replace=False), 'sepal width (cm)'] = np.nan
iris_df.loc[np.random.choice(iris_df.index, size=4, replace=False), 'petal length (cm)'] = np.nan
iris_df.loc[np.random.choice(iris_df.index, size=2, replace=False), 'petal width (cm)'] = np.nan

# Initialize ImputationKernel with 'ascending' order
kernel_ascending = mf.ImputationKernel(iris_df, imputation_order='ascending')

# Initialize ImputationKernel with 'descending' order
kernel_descending = mf.ImputationKernel(iris_df, imputation_order='descending')

# Compare imputation orders
ascending_order = kernel_ascending.imputation_order
descending_order = kernel_descending.imputation_order

# Print equality comparison result
print("Are the imputation orders equal?")
print(ascending_order == descending_order)

Additional Information