Reproductibility problems with Extented Isolation Forest

david-cortes / isotree

(Python, R, C/C++) Isolation Forest and variations such as SCiForest and EIF, with some additions (outlier detection + similarity + NA imputation)

https://isotree.readthedocs.io

BSD 2-Clause "Simplified" License

186 stars 38 forks source link

Reproductibility problems with Extented Isolation Forest #46

Closed Harmadah closed 2 years ago

Harmadah commented 2 years ago

Hi @david-cortes, thank you for this great package.

I'm currently using isotree to fit an extented isolation forest model. My issue is the following : I created, fitted and tested for anomaly detection an instance of IsolationForest with : (ndim=2, max_samples = int(len(data)/20, ntrees=500, ntry=1, random_seed=0,max_depth=12, missing_action="fail", coefs="normal", standardize_data=True, penalize_range=True,n_threads=2,bootstrap=False,prob_pick_pool_gain=1)

After this I implmeented the same model with the same hyperparameters in another script of mine. However, when looking at the scores after fitting this model to the same data as the previous one I find different values. The values are really close to the ones obtained previously but are still different. I wonder whether there is a randomness factor that I didn't control through my parameters (I thought fixing random seed would suffice) or if it is a real issue. Many thanks in advance for your assistance.

david-cortes commented 2 years ago

Could you provide the following information:

Version of isotree that you are using.
How you installed it.
Operating system.
Compiler (including specific version) with which the package was built.
Python version.
Whether you are running the script under the exact same python/conda environment.

Harmadah commented 2 years ago

I'm using

Isotree 0.5.16.
Installed it through pip via pip install isotree
Ubuntu 20.04.4 LTS
Python 3.9.13 and GCC 9.4.0
I used the same environment for both

Thank you again for your quick answer and precious help.

david-cortes commented 2 years ago

Are you able to provide an example with some irreproducible results in the same script?

david-cortes commented 2 years ago

And another question: what kind of data are you passing to it? Is it a sparse matrix?

Harmadah commented 2 years ago

I'm really sorry I can't provide you the real data that I am using (due to privacy issues). In terms of shape I am using a numpy.ndarray of np.float64 with this shape (204599,2). The first column being random continuous values between [0,1] the other being a boolean column consisting of discrete 0 and ones. The array is not a sparse matrix.

To verify what I was seeing I used joblib to dump and load the array from my first script to my second script, let's call this array array1. In the second script I tested if array1 and array2 (array used in the second script) were equal and python tells me they are (returning the boolean value True). However when I do a scoring on array1 and array2 the scores differ, in particular I get 0.89 for the max score of array1 and 0.91 for the max score of arraay2 with the hyperparameters mentionned in my first comment.

david-cortes commented 2 years ago

I am unable to find any irreproducibility. Here's an example - does it fail for you?

import numpy as np
from isotree import IsolationForest

rng = np.random.default_rng(seed=123)
n_rows = 20000
X = np.empty((n_rows, 2))
X[:,0] = rng.random(size=n_rows)
X[:,1] = (rng.random(size=n_rows) >= 0.5).astype(np.float64)

pred1 = IsolationForest(
    ndim=2,
    max_samples = int(n_rows/20),
    ntrees=500,
    ntry=1,
    random_seed=0,
    max_depth=12,
    missing_action="fail",
    coefs="normal",
    standardize_data=True,
    penalize_range=True,
    nthreads=2,
    bootstrap=False,
    prob_pick_pooled_gain=1
).fit(X).predict(X)

pred2 = IsolationForest(
    ndim=2,
    max_samples = int(n_rows/20),
    ntrees=500,
    ntry=1,
    random_seed=0,
    max_depth=12,
    missing_action="fail",
    coefs="normal",
    standardize_data=True,
    penalize_range=True,
    nthreads=2,
    bootstrap=False,
    prob_pick_pooled_gain=1
).fit(X).predict(X)

assert np.all(pred1 == pred2)

david-cortes commented 2 years ago

Yet another question: are you using the same computer and/or same virtual machine in both cases?

Harmadah commented 2 years ago

Hello I tested everything again and found out where the issue lied. It was in my code in my preprocessing pipeline that was causing the issue.There were still some operations done on the arrays after I evaluated their equality which led me to believe that the model was responsible. I'm really sorry for the bother, I think you can close the issue or even delete it since your package was not at fault.

Nonetheless I'd like to thank you again for your time and for this excellent package which is really well documented. Best regards.