david-cortes / isotree

(Python, R, C/C++) Isolation Forest and variations such as SCiForest and EIF, with some additions (outlier detection + similarity + NA imputation)
https://isotree.readthedocs.io
BSD 2-Clause "Simplified" License
186 stars 38 forks source link

Regular Isolation Forest fitting very slow for large max_samples #17

Closed svenvanhal closed 3 years ago

svenvanhal commented 3 years ago

Hi David, let me first thank you for this excellent implementation of the Isolation Forest algorithm and variants.

I've discovered some strange behavior regarding the regular Isolation Forest fitting duration. Using a low value for max_samples, e.g. 256, fitting times are comparable if not faster than the scikit-learn implementation. However, increasing this number does not scale as one would expect.

For example, using a training set of 750,000 samples and a max_samples value of 256, fitting an IsoTree iForest takes 0.1s compared to 1.3s for the scikit-learn implementation. When increasing the number of max_samples to 65,536, the fitting time of the IsoTree model explodes to 119.7s, as opposed to only 2.6s for the scikit-learn implementation. The number of trees is fixed for all experiments and nthreads/n_jobs is set to -1.

Fitting an Extended Isolation Forest (ndim=2) in the exact same setting (750k training samples, max_samples=65536) for comparison takes only 2.6s.

Can you shed some light on this behavior?

I'm using the latest IsoTree version from pip (0.1.21)

Please find the benchmark code attached in this gist. My output is as follows:

# samples fit:     75,000
# samples predict: 25,000

max_samples: 256
                       Fit time  Predict time
[Scikit-Learn / IF]        0.2s          0.2s
[IsoTree / IF]             0.0s          0.0s
[IsoTree / EIF]            0.0s          0.1s

max_samples: 2048
                       Fit time  Predict time
[Scikit-Learn / IF]        0.2s          0.2s
[IsoTree / IF]             0.4s          0.0s
[IsoTree / EIF]            0.0s          0.1s

max_samples: 16384
                       Fit time  Predict time
[Scikit-Learn / IF]        0.2s          0.2s
[IsoTree / IF]             9.6s          0.1s
[IsoTree / EIF]            0.4s          0.1s

max_samples: 65536
                       Fit time  Predict time
[Scikit-Learn / IF]        0.4s          0.2s
[IsoTree / IF]             6.3s          0.1s
[IsoTree / EIF]            1.9s          0.3s

---

# samples fit:     750,000
# samples predict: 250,000

max_samples: 256
                       Fit time  Predict time
[Scikit-Learn / IF]        1.3s          1.5s
[IsoTree / IF]             0.1s          0.3s
[IsoTree / EIF]            0.1s          0.6s

max_samples: 2048
                       Fit time  Predict time
[Scikit-Learn / IF]        1.3s          1.7s
[IsoTree / IF]             0.6s          0.6s
[IsoTree / EIF]            0.1s          1.1s

max_samples: 16384
                       Fit time  Predict time
[Scikit-Learn / IF]        2.1s          2.2s
[IsoTree / IF]            11.3s          0.6s
[IsoTree / EIF]            0.7s          1.7s

max_samples: 65536
                       Fit time  Predict time
[Scikit-Learn / IF]        2.6s          2.6s
[IsoTree / IF]           119.7s          0.6s
[IsoTree / EIF]            2.6s          2.1s
david-cortes commented 3 years ago

Thanks for the bug report, I'm able to reproduce it and will investigate what causes this slow down.

But for now, if you want to make it fast, you can use missing_action="fail". Here are the results on my machine with that option turned on:

# samples fit:     75,000
# samples predict: 25,000

max_samples: 256
                       Fit time  Predict time
[Scikit-Learn / IF]        0.2s          0.1s
[IsoTree / IF]             0.0s          0.0s
[IsoTree / EIF]            0.0s          0.0s

max_samples: 2048
                       Fit time  Predict time
[Scikit-Learn / IF]        0.2s          0.2s
[IsoTree / IF]             0.0s          0.0s
[IsoTree / EIF]            0.0s          0.0s

max_samples: 16384
                       Fit time  Predict time
[Scikit-Learn / IF]        0.3s          0.2s
[IsoTree / IF]             0.0s          0.0s
[IsoTree / EIF]            0.0s          0.0s

max_samples: 65536
                       Fit time  Predict time
[Scikit-Learn / IF]        0.3s          0.2s
[IsoTree / IF]             0.0s          0.0s
[IsoTree / EIF]            0.1s          0.0s

---

# samples fit:     750,000
# samples predict: 250,000

max_samples: 256
                       Fit time  Predict time
[Scikit-Learn / IF]        0.9s          1.3s
[IsoTree / IF]             0.0s          0.1s
[IsoTree / EIF]            0.1s          0.1s

max_samples: 2048
                       Fit time  Predict time
[Scikit-Learn / IF]        1.1s          1.6s
[IsoTree / IF]             0.1s          0.1s
[IsoTree / EIF]            0.1s          0.1s

max_samples: 16384
                       Fit time  Predict time
[Scikit-Learn / IF]        1.2s          1.8s
[IsoTree / IF]             0.1s          0.1s
[IsoTree / EIF]            0.1s          0.2s

max_samples: 65536
                       Fit time  Predict time
[Scikit-Learn / IF]        1.6s          2.1s
[IsoTree / IF]             0.2s          0.1s
[IsoTree / EIF]            0.4s          0.3s
svenvanhal commented 3 years ago

Thanks for your swift reply! Happy there's a workaround for now, works on my machine as well.

david-cortes commented 3 years ago

Fixed, but I now see that there will also be a similar issue if the number of columns is very large, and affects the single-variable and multi-variable models.