david-cortes / isotree

(Python, R, C/C++) Isolation Forest and variations such as SCiForest and EIF, with some additions (outlier detection + similarity + NA imputation)
https://isotree.readthedocs.io
BSD 2-Clause "Simplified" License
192 stars 38 forks source link

Error when using fit() and build_imputer=True is misleading #42

Closed AnotherSamWilson closed 2 years ago

AnotherSamWilson commented 2 years ago

Can IsolationForest use a .fit() method when build_imputer=True? Every attempt I make fails, however the error message I receive doesn't necessarily make me think that this is impossible. See the example below:

from sklearn.datasets import fetch_california_housing
import pandas as pd
import numpy as np
cali = pd.concat(fetch_california_housing(return_X_y=True, as_frame=True), axis=1)

for c in cali.columns:
    ind = np.random.choice(cali.shape[0], size=100)
    cali.loc[ind,c] = np.NaN

from isotree import IsolationForest
imputer = IsolationForest(
    build_imputer=True
)

# Works as intended
imputer.fit_transform(cali)

# Throws the error below
imputer.fit(cali)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "C:\Users\swilson\anaconda3\envs\impcomp\lib\site-packages\isotree\__init__.py", line 1382, in fit
    self._cpp_obj.fit_model(_get_num_dtype(X_num, sample_weights, column_weights),
  File "isotree\cpp_interface.pyx", line 855, in isotree._cpp_interface.isoforest_cpp_obj.fit_model
RuntimeError: Cannot produce missing data imputations at fit time when using sub-sampling.

The error makes me think that I just have an incorrect parameter, but I have tried different sampling parameters with no luck. The fact that .fit_transform() works with no problems makes me think that .fit() simply isn't supported.

david-cortes commented 2 years ago

Thanks for the bug report. I've pushed some small changes that should fix the isse - could you give it a try?

pip install git+https://github.com/david-cortes/isotree.git
AnotherSamWilson commented 2 years ago

This worked in my env. Awesome, so new data can be imputed, that's great news. Thanks! Record for fastest bug resolution in history?