Closed jcrist closed 7 years ago
cc @pinjutien, @yzard, @bansalr, this should provide some more speedups on top of #37.
@jcrist Salute!, thanks for your effort, we really appreciate, right now we will run on several hundred cores machines to see the results!
What size data (order of magnitude) are you pushing through this @yzard?
@jakirkham a few hundred megabytes.
This adds a few optimizations to reduce the overhead of graph building and serialization:
Instead of sending parameters as a list of dicts, each with the same keys, we send the keys and a list of tuples. This reduces the serialized size of the parameters by ~50%.
Previously every cv_split + param combo had at least 2 tasks: one to fit and one to score. For the common case of non-pipelines, we can combine these into a single task, reducing the graph size by 50%.
Using the following benchmark:
Master:
This PR: