PasaLab / forestlayer

ForestLayer: Efficient and scalable deep forest learning library based on Ray
Apache License 2.0
10 stars 4 forks source link

ForestLayer

ForestLayer is an highly efficient and scalable deep forest learning library based on Ray. It provides rich easy-to-use data processing, model training modules to help researchers and engineers build practical deep forest learning workflows. It internally embeds task parallelization mechanism using Ray, which is a popular flexible, high-performance distributed execution framework proposed by U.C.Berkeley.
ForestLayer aims to enable faster experimentation as possible and reduce the delay from idea to result.

Hope that ForestLayer can bring you good researches and good products.

You can refer to Deep Forest Paper, Ray Project to find more details.

News

Prerequisites

numpy, scikit-learn, keras, ray, joblib, xgboost, psutil, matplotlib, pandas

Installation

ForestLayer has install prerequisites including scikit-learn, keras, numpy, ray and joblib. For GPU support, CUDA and cuDNN are required, but now we have not support GPU yet. The simplest way to install ForestLayer in your python program is:

[for master version] pip install git+https://github.com/whatbeg/forestlayer.git
[for stable version] pip install forestlayer

Alternatively, you can install ForestLayer from the github source:

$ git clone https://github.com/whatbeg/forestlayer.git

$ cd forestlayer
$ python setup.py install

What is deep forest?

The architecture of deep forest. from [1]

[1] Deep Forest: Towards An Alternative to Deep Neural Networks, by Zhi-Hua Zhou and Ji Feng. IJCAI 2017

Getting Started: 30 seconds to ForestLayer

The core data structure of ForestLayer is layers and graph. Layers are basic modules to implement different data processing, and the graph is like a model that organize layers, the basic type of graph is a stacking of layers, and now we only support this type of graph.

Take MNIST classification task as an example.

First, we use the Keras API to load mnist data and do some pre-processing.

(x_train, y_train), (x_test, y_test) = mnist.load_data()
# TODO: preprocessing...

Next, we construct multi-grain scan windows and estimators every window and then initialize a MultiGrainScanLayer. The Window class is lies in forestlayer.layers.window package and the estimators are represented as EstimatorArguments, which will be used later in layers to create actual estimator object.

rf1 = ExtraRandomForestConfig(min_samples_leaf=10, max_features='sqrt')
rf2 = RandomForestConfig(min_samples_leaf=10)

windows = [Window(win_x=7, win_y=7, stride_x=2, stride_y=2, pad_x=0, pad_y=0),
           Window(11, 11, 2, 2)]

est_for_windows = [[rf1, rf2], [rf1, rf2]]

mgs = MultiGrainScanLayer(windows=windows, est_for_windows=est_for_windows, n_class=10)

After multi-grain scan, we consider that building a pooling layer to reduce the dimension of generated feature vectors, so that reduce the computation and storage complexity and risk of overfiting.

pools = [[MaxPooling(2, 2), MaxPooling(2, 2)],
         [MaxPooling(2, 2), MaxPooling(2, 2)]]

pool = PoolingLayer(pools=pools)

And then we add a concat layer to concatenate the output of estimators of the same window.

concatlayer = ConcatLayer()

Then, we construct the cascade part of the model, we use an auto-growing cascade layer to build our deep forest model.

est_configs = [
    ExtraRandomForestConfig(),
    ExtraRandomForestConfig(),
    RandomForestConfig(),
    RandomForestConfig()
]

auto_cascade = AutoGrowingCascadeLayer(est_configs=est_configs, early_stopping_rounds=4,
                                       stop_by_test=True, n_classes=10, distribute=False)

Last, we construct a graph to stack these layers to make them as a complete model.

model = Graph()
model.add(mgs)
model.add(pool)
model.add(concatlayer)
model.add(auto_cascade)

You also can call model.summary() like Keras to see the appearance of the model. Summary info could be like follows table,

____________________________________________________________________________________________________
Layer                    Description                                                        Param #
====================================================================================================
MultiGrainScanLayer      [win/7x7, win/11x11]                                               params
                         [[FLCRF, FLRF][FLCRF, FLRF]]
____________________________________________________________________________________________________
PoolingLayer             [[maxpool/2x2, maxpool/2x2][maxpool/2x2, maxpool/2x2]]             params
____________________________________________________________________________________________________
ConcatLayer              ConcatLayer(axis=-1)                                               params
____________________________________________________________________________________________________
AutoGrowingCascadeLayer  maxlayer=0, esrounds=3                                             params
                         Each Level:
                         [FLCRF, FLCRF, FLRF, FLRF]
====================================================================================================

After building the model, you can fit the model, and then evaluate or predict using the fit model. Because the model is often very large (dozens of trained forests, every of them occupies hundreds of megabytes), so we highly recommend users to use fit_transform to train and test data, thus we can make keep_in_mem=False to avoid the cost of caching model in memory or disk.

# 1
model.fit(x_train, y_train)
model.evaluate(x_test, y_test)
result = model.predict(x_in)

# 2 (recommend)
model.fit_transform(x_train, y_train, x_test, y_test)

For more examples and tutorials, you can refer to examples to find more details.

Examples

See examples.

etc.

Contributions

Welcome contributions, you can contact huqiu00 at 163.com.

License

Please contact the authors for the licence info of the source code.