dask / dask-searchcv

dask-searchcv is now part of dask-ml: https://github.com/dask/dask-ml
BSD 3-Clause "New" or "Revised" License
240 stars 43 forks source link

More optimizations #38

Closed jcrist closed 7 years ago

jcrist commented 7 years ago

This adds a few optimizations to reduce the overhead of graph building and serialization:

Instead of sending parameters as a list of dicts, each with the same keys, we send the keys and a list of tuples. This reduces the serialized size of the parameters by ~50%.

Previously every cv_split + param combo had at least 2 tasks: one to fit and one to score. For the common case of non-pipelines, we can combine these into a single task, reducing the graph size by 50%.

Using the following benchmark:

import numpy as np
from sklearn.tree import DecisionTreeClassifier
from dklearn import DaskGridSearchCV
from sklearn.datasets import make_classification
from timeit import default_timer
import pickle

# Make data small so nbytes cost is negligible
X, y = make_classification(2, n_features=5)

model = DecisionTreeClassifier()
# 500,000 samples
grid = {'max_depth': np.arange(1, 1001),      # 1000
        'random_state': np.arange(100),       # 100
        'min_samples_leaf': np.arange(1, 6)}  # 5

grid_search = DaskGridSearchCV(model, grid)
start = default_timer()
grid_search.fit(X, y)
stop = default_timer()
print("Graph building took %.3f seconds" % (stop - start))
print("Graph has %d tasks" % len(grid_search.dask_graph_))
start = default_timer()
nbytes = sum(len(pickle.dumps(v, 4)) for v in grid_search.dask_graph_.values())
stop = default_timer()
print("Serialized graph takes %.3f GB" % (nbytes / 1e9))
print("Serializing graph took %.3f seconds" % (stop - start))

Master:

(dask) jcrist grid-search $ python dklearn_grid_search_script.py
Graph building took 14.465 seconds
Graph has 3000008 tasks
Serialized graph takes 1.403 GB
Serializing graph took 42.132 seconds

This PR:

(dask) jcrist grid-search $ python dklearn_grid_search_script.py
Graph building took 10.092 seconds
Graph has 1500009 tasks
Serialized graph takes 0.761 GB
Serializing graph took 31.358 seconds
jcrist commented 7 years ago

cc @pinjutien, @yzard, @bansalr, this should provide some more speedups on top of #37.

yzard commented 7 years ago

@jcrist Salute!, thanks for your effort, we really appreciate, right now we will run on several hundred cores machines to see the results!

jakirkham commented 7 years ago

What size data (order of magnitude) are you pushing through this @yzard?

yzard commented 7 years ago

@jakirkham a few hundred megabytes.