david-cortes / isotree

(Python, R, C/C++) Isolation Forest and variations such as SCiForest and EIF, with some additions (outlier detection + similarity + NA imputation)
https://isotree.readthedocs.io
BSD 2-Clause "Simplified" License
197 stars 38 forks source link
anomaly-detection imputation isolation-forest outlier-detection

IsoTree

Fast and multi-threaded implementation of Isolation Forest (a.k.a. iForest) and variations of it such as Extended Isolation Forest (EIF), Split-Criterion iForest (SCiForest), Fair-Cut Forest (FCF), Robust Random-Cut Forest (RRCF), and other customizable variants, aimed at outlier/anomaly detection plus additions for imputation of missing values, distance/similarity calculation between observations, and handling of categorical data. Written in C++ with interfaces for Python, R, and C. An additional wrapper for Ruby can be found here.

The new concepts in this software are described in:


For a quick introduction to the Isolation Forest concept as used in this library, see:

Short Python example notebooks:

(R examples are available in the internal documentation)

Description

Isolation Forest is an algorithm originally developed for outlier detection that consists in splitting sub-samples of the data according to some attribute/feature/column at random. The idea is that, the rarer the observation, the more likely it is that a random uniform split on some feature would put outliers alone in one branch, and the fewer splits it will take to isolate an outlier observation like this. The concept is extended to splitting hyperplanes in the extended model (i.e. splitting by more than one column at a time), and to guided (not entirely random) splits in the SCiForest model that aim at isolating outliers faster and finding clustered outliers.

Note that this is a black-box model that will not produce explanations or importances - for a different take on explainable outlier detection see OutlierTree.

image

(Code to produce these plots can be found in the R examples in the documentation)

Comparison against other libraries

The folder timings contains a speed comparison against other Isolation Forest implementations in Python (SciKit-Learn, EIF) and R (IsolationForest, isofor, solitude). From the benchmarks, IsoTree tends to be at least 1 order of magnitude faster than the libraries compared against in both single-threaded and multi-threaded mode.

Example timings for 100 trees and different sample sizes, CovType dataset - see the link above for full benchmark and details:

Library Model Time (s) 256 Time (s) 1024 Time (s) 10k
isotree orig 0.00161 0.00631 0.0848
isotree ext 0.00326 0.0123 0.168
eif orig 0.149 0.398 4.99
eif ext 0.16 0.428 5.06
h2o orig 9.33 11.21 14.23
h2o ext 1.06 2.07 17.31
scikit-learn orig 8.3 8.01 6.89
solitude orig 32.612 34.01 41.01

Example AUC as outlier detector in typical datasets (notebook to produce results here):

Library AUROC defaults AUROC grid search
isotree 0.70 0.84
eif - 0.714
scikit-learn 0.687 0.74
h2o 0.662 0.748
Library AUROC defaults AUROC grid search
isotree 0.80 0.982
eif - 0.808
scikit-learn 0.836 0.836
h2o 0.80 0.80

(Disclaimer: these are rather small datasets and thus these AUC estimates have high variance)

Non-random splits

While the original idea behind isolation forests consisted in deciding splits uniformly at random, it's possible to get better performance at detecting outliers in some datasets (particularly those with multimodal distributions) by determining splits according to an information gain criterion instead. The idea is described in "Revisiting randomized choices in isolation forests" along with some comparisons of different split guiding criteria.

Different outlier scoring criteria

Although the intuition behind the algorithm was to look at the tree depth required for isolation, this package can also produce outlier scores based on density criteria, which provide improved results in some datasets, particularly when splitting on categorical features. The idea is described in "Isolation forests: looking beyond tree depth".

Distance / similarity calculations

General idea was extended to produce distance (alternatively, similarity) between observations according to how many random splits it takes to separate them - idea is described in "Distance approximation using Isolation Forests".

Imputation of missing values

The model can also be used to impute missing values in a similar fashion as kNN, by taking the values from observations in the terminal nodes of each tree in which an observation with missing values falls at prediction time, combining the non-missing values of the other observations as a weighted average according to the depth of the node and the number of observations that fall there. This is not related to how the model handles missing values internally, but is rather meant as a faster way of imputing by similarity. Quality is usually not as good as chained equations, but the method is a lot faster and more scalable. Recommended to use non-random splits when used as an imputer. Details are described in "Imputing missing values with unsupervised random trees".

Highlights

There's already many available implementations of isolation forests for both Python and R (such as the one from the original paper's authors' or the one in SciKit-Learn), but at the time of writing, all of them are lacking some important functionality and/or offer sub-optimal speed. This particular implementation offers the following:

(Note that categoricals, NAs, and density-like sample weights, are treated heuristically with different options as there is no single logical extension of the original idea to them, and having them present might degrade performance/accuracy for regular numerical non-missing observations)

Installation

Note: This package benefits from extra optimizations that aren't enabled by default for R packages. See this guide for instructions on how to enable them.

install.packages("isotree")

Note: requires C/C++ compilers configured for Python. See this guide for instructions.

pip install isotree

or if that fails:

pip install --no-use-pep517 isotree

Note for macOS users: on macOS, the Python version of this package might compile without multi-threading capabilities. In order to enable multi-threading support, first install OpenMP:

brew install libomp

And then reinstall this package: pip install --upgrade --no-deps --force-reinstall isotree.


IMPORTANT: the setup script will try to add compilation flag -march=native. This instructs the compiler to tune the package for the CPU in which it is being installed (by e.g. using AVX instructions if available), but the result might not be usable in other computers. If building a binary wheel of this package or putting it into a docker image which will be used in different machines, this can be overriden either by (a) defining an environment variable DONT_SET_MARCH=1, or by (b) manually supplying compilation CFLAGS as an environment variable with something related to architecture. For maximum compatibility (but slowest speed), it's possible to do something like this:

export DONT_SET_MARCH=1
pip install isotree

or, by specifying some compilation flag for architecture:

export CFLAGS="-march=x86-64"
export CXXFLAGS="-march=x86-64"
pip install isotree

for a system-wide install in linux

sudo make install sudo ldconfig


(Will build as a shared object - linkage is then done with `-lisotree`)

Be aware that the snippet above includes option `-DUSE_MARCH_NATIVE=1`, which will make it use the highest-available CPU instruction set (e.g. AVX2) and will produces objects that might not run on older CPUs - to build more "portable" objects, remove this option from the cmake command.

The package has an optional dependency on the [Robin-Map](https://github.com/Tessil/robin-map) library, which is added to this repository as a linked submodule. If this library is not found under `/src`, will use the compiler's own hashmaps, which are less optimal.

* Ruby:

See [external repository with wrapper](https://github.com/ankane/isotree).

# Sample usage

**Warning: default parameters in this implementation are very different from default parameters in others such as Scikit-Learn's, and these defaults won't scale to large datasets (see documentation for details).**

* Python:

(Library is Scikit-Learn compatible)

```python
import numpy as np
from isotree import IsolationForest

### Random data from a standard normal distribution
np.random.seed(1)
n = 100
m = 2
X = np.random.normal(size = (n, m))

### Will now add obvious outlier point (3, 3) to the data
X = np.r_[X, np.array([3, 3]).reshape((1, m))]

### Fit a small isolation forest model
iso = IsolationForest(ntrees = 10, nthreads = 1)
iso.fit(X)

### Check which row has the highest outlier score
pred = iso.predict(X)
print("Point with highest outlier score: ",
      X[np.argsort(-pred)[0], ])

(see documentation for more examples - help(isotree::isolation.forest))

### Random data from a standard normal distribution
library(isotree)
set.seed(1)
n <- 100
m <- 2
X <- matrix(rnorm(n * m), nrow = n)

### Will now add obvious outlier point (3, 3) to the data
X <- rbind(X, c(3, 3))

### Fit a small isolation forest model
iso <- isolation.forest(X, ntrees = 10, nthreads = 1)

### Check which row has the highest outlier score
pred <- predict(iso, X)
cat("Point with highest outlier score: ",
    X[which.max(pred), ], "\n")

The package comes with two different C++ interfaces: (a) a struct-based interface which exposes the full library's functionalities but makes little checks on the inputs it receives and is difficult to use due to the large number of arguments that functions require; and (b) a scikit-learn-like interface in which the model exposes a single class with methods like 'fit' and 'predict', which is less flexible than the struct-based interface but easier to use and the function signatures disallow some potential errors due to invalid parameter combinations. The latter ((b)) is recommended to use unless some specific functionality from (a) is required.

See files: isotree_cpp_oop_ex.cpp for an example with the scikit-learn-like interface (recommended); and isotree_cpp_ex.cpp for an example with the struct-based interface.

Note that the second interface does not expose all the functionalities - for example, it only supports inputs of classes 'double' and 'int', while the struct-based interface also supports 'float'/'size_t'.

See file isotree_c_ex.c.

Note that the C interface is a simple wrapper over a subset of the scikit-learn-like C++ interface, but using only ISO C bindings for better compatibility and easier wrapping in other languages.

See external repository with wrapper.

Examples

Documentation

Reducing library size and compilation times

By default, this library will compile with some functionalities that are unlikely to be used and which can significantly increase the size of the library and compilation times - if using this library in e.g. embedded devices, it is highly recommended to disable some options, and if creating a docker images for serving models, one might want to make it as minimal as possible. Being a C++ templated library, it generates multiple versions of its functions that are specialized for different types (such as C double and float), and in practice not all the supported types are likely to be used.

In particular, the library supports usage of long double type for more precise aggregated calculations (e.g. standard deviations), which is unlikely to end up used (its usage is determined by a user-passed function argument and not available in the C or C++-OOP interfaces). For a smaller library and faster compilation, support for long double can be disabled by:

Additionally, the library will produce functions for different floating point and integer types of the input data. In practice, one usually ends up using only double and int types (these are the only types supported in the R interface and in the C and C++-OOP interfaces). When building it as a shared library through the CMake system, these can be disabled (leaving only double and int support) through option NO_TEMPLATED_VERSIONS - e.g.:

cmake -DNO_TEMPLATED_VERSIONS=1 ..

(this option is not available for the Python build system)

References