david-cortes / isotree

(Python, R, C/C++) Isolation Forest and variations such as SCiForest and EIF, with some additions (outlier detection + similarity + NA imputation)
https://isotree.readthedocs.io
BSD 2-Clause "Simplified" License
186 stars 38 forks source link

Same category is always imputed when enough trees are grown #43

Closed AnotherSamWilson closed 2 years ago

AnotherSamWilson commented 2 years ago

See this example:

from sklearn.datasets import load_iris
import pandas as pd
import numpy as np
iris = pd.concat(load_iris(return_X_y=True, as_frame=True), axis=1)
iris["target"] = iris["target"].astype("category")

amp_iris = iris.copy()
na_where = {}
for c in iris.columns:
    na_where[c] = sorted(np.random.choice(amp_iris.shape[0], size=25, replace=False))
    amp_iris.loc[na_where[c],c] = np.NaN

# Only class 0 was imputed
from isotree import IsolationForest
imputer = IsolationForest(
    ntrees=100,
    build_imputer=True,
    ndim=1,
    missing_action="impute"
)
imp_iris = imputer.fit_transform(amp_iris)
t = "target"
imp_iris.loc[na_where[t], t].unique()

# Use less trees, process is much more accurate
imputer = IsolationForest(
    ntrees=10,
    build_imputer=True,
    ndim=1,
    missing_action="impute"
)
imp_iris = imputer.fit_transform(amp_iris)
(imp_iris.loc[na_where[t], t] == iris.loc[na_where[t], t]).mean()

Using any number of trees over 100 caused only the first class (0) to ever be imputed. Using only 10 trees usually makes the imputation much more accurate. I tried playing around with different max_depths, but to no avail. Are there any obvious parameters I am missing to make the categorical imputation more accurate?

david-cortes commented 2 years ago

Thanks again for the bug report. There is an issue in the code calculations with some numbers turning into infinite so in the meantime better not use fit_transform.

david-cortes commented 2 years ago

Should be fixed now in the latest version:

pip install -U isotree
david-cortes commented 2 years ago

And by the way, the imputation is meant to be used alongside with prob_pick_pooled_gain - chances are that results wouldn't be very good otherwise.