Closed svenvanhal closed 3 years ago
Thanks for the bug report, I'm able to reproduce it and will investigate what causes this slow down.
But for now, if you want to make it fast, you can use missing_action="fail"
. Here are the results on my machine with that option turned on:
# samples fit: 75,000
# samples predict: 25,000
max_samples: 256
Fit time Predict time
[Scikit-Learn / IF] 0.2s 0.1s
[IsoTree / IF] 0.0s 0.0s
[IsoTree / EIF] 0.0s 0.0s
max_samples: 2048
Fit time Predict time
[Scikit-Learn / IF] 0.2s 0.2s
[IsoTree / IF] 0.0s 0.0s
[IsoTree / EIF] 0.0s 0.0s
max_samples: 16384
Fit time Predict time
[Scikit-Learn / IF] 0.3s 0.2s
[IsoTree / IF] 0.0s 0.0s
[IsoTree / EIF] 0.0s 0.0s
max_samples: 65536
Fit time Predict time
[Scikit-Learn / IF] 0.3s 0.2s
[IsoTree / IF] 0.0s 0.0s
[IsoTree / EIF] 0.1s 0.0s
---
# samples fit: 750,000
# samples predict: 250,000
max_samples: 256
Fit time Predict time
[Scikit-Learn / IF] 0.9s 1.3s
[IsoTree / IF] 0.0s 0.1s
[IsoTree / EIF] 0.1s 0.1s
max_samples: 2048
Fit time Predict time
[Scikit-Learn / IF] 1.1s 1.6s
[IsoTree / IF] 0.1s 0.1s
[IsoTree / EIF] 0.1s 0.1s
max_samples: 16384
Fit time Predict time
[Scikit-Learn / IF] 1.2s 1.8s
[IsoTree / IF] 0.1s 0.1s
[IsoTree / EIF] 0.1s 0.2s
max_samples: 65536
Fit time Predict time
[Scikit-Learn / IF] 1.6s 2.1s
[IsoTree / IF] 0.2s 0.1s
[IsoTree / EIF] 0.4s 0.3s
Thanks for your swift reply! Happy there's a workaround for now, works on my machine as well.
Fixed, but I now see that there will also be a similar issue if the number of columns is very large, and affects the single-variable and multi-variable models.
Hi David, let me first thank you for this excellent implementation of the Isolation Forest algorithm and variants.
I've discovered some strange behavior regarding the regular Isolation Forest fitting duration. Using a low value for
max_samples
, e.g. 256, fitting times are comparable if not faster than the scikit-learn implementation. However, increasing this number does not scale as one would expect.For example, using a training set of 750,000 samples and a
max_samples
value of 256, fitting an IsoTree iForest takes 0.1s compared to 1.3s for the scikit-learn implementation. When increasing the number ofmax_samples
to 65,536, the fitting time of the IsoTree model explodes to 119.7s, as opposed to only 2.6s for the scikit-learn implementation. The number of trees is fixed for all experiments andnthreads
/n_jobs
is set to -1.Fitting an Extended Isolation Forest (
ndim=2
) in the exact same setting (750k training samples,max_samples=65536
) for comparison takes only 2.6s.Can you shed some light on this behavior?
I'm using the latest IsoTree version from pip (0.1.21)
Please find the benchmark code attached in this gist. My output is as follows: