dask / dask-ml

Scalable Machine Learning with Dask
http://ml.dask.org
BSD 3-Clause "New" or "Revised" License
893 stars 255 forks source link

.partial_fit() for sklearn forest ensembles – useful? #452

Open garethjns opened 5 years ago

garethjns commented 5 years ago

I’ve been playing with dask for a while and as a incremental model fitting learning exercise, have made some extensions to the sklearn forest ensembles. Basically, it’s the addition of a .partial_fit() method to RandomForestClassifer and ExtraTreesClassifier to make them work with dask-ml’s Incremental wrapper.

To be clear, this isn’t a decision tree that supports incremental fitting (which would be a considerably more complex endeavour), they’re forests that allow training of individual or small groups of trees on different chunks of rows, rather than all (bootstrapped or not) rows in the training set.

I haven’t tested performance extensively, but from what I’ve seen it’s generally at least as good as the standard models after seeing an equivalent number of rows of data. Potentially better as the model will see a greater variety of data compared to the alternative of sampling a limited training set to fit in memory. They also perform better than SGD (am I correct in thinking that it’s only some of the linear sklearn models that currently implement partial fit?)

The main modifications to the sklearn classes are:

If there aren’t currently other non-linear models available for incremental fitting, and this approach is worth perusing, I’d be happy to continue work on it and contribute it to dask-ml. I think it would be relatively simple to extend the current implementations to work with the Regressors and possibly the sklearn GradientBoosting ensembles as well.

The code is here https://github.com/garethjns/IncrementalTrees and here’s a usage example with dask. There are also a couple more in the readme showing a couple of different operating modes.

import numpy as np
import dask_ml.datasets
from dask_ml.wrappers import Incremental
from dask.distributed import Client, LocalCluster
from dask import delayed
from incremental_trees.trees import StreamingRFC

# Generate some data out-of-core
x, y = dask_ml.datasets.make_blobs(n_samples=2e5, chunks=1e4, random_state=0,
                                   n_features=40, centers=2, cluster_std=100)

# Create throwawy cluster and client to run on                                  
with LocalCluster(processes=False, 
                  n_workers=2, 
                  threads_per_worker=2) as cluster, Client(cluster) as client:

    # Wrap model with Dask Incremental
    srfc = Incremental(StreamingRFC(n_estimators_per_chunk=10,
                                    max_n_estimators=np.inf,
                                    n_jobs=4))

    # Call fit directly, specifying the expected classes
    srfc.fit(x, y,
             classes=delayed(np.unique(y)).compute())

    print(len(srfc.estimators_))
    print(srfc.score(x, y))
TomAugspurger commented 5 years ago

Thanks for this. I'll take a closer look again later.

This just came up on the scikit-learn mailing list, if you want to chime in there: https://mail.python.org/pipermail/scikit-learn/2019-March/003050.html

garethjns commented 5 years ago

Hi Tom,

Thanks for the heads up, I hand't seen the mailing list.