At the moment all predictions are done sequential. When predicting a few thousend samples it is slower than training the model itself.
Idea
The computational time can be decreased by a few orders of magnitude by predicting all samples of a node at once. And sorting the predictions to the order of the given X to predict on.
Speed comparision
I run my patch with two different datasets and used cProfile to messure the improvement. In both cases the computation time was improved by two orders of magnitude.
dataset: dmc_2003
X_train shape: (8000, 50)
n classes in y: 2
11177 samples to predict
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 240.757 240.757 *\_lce.py:430(predict) <- current version
1 0.000 0.000 118.697 118.697 *\_lce.py:393(fit)
1 0.000 0.000 0.885 0.885 *\_lce.py:430(predict) <- patch version
dataset: dmc_2007
X_train shape: (10000, 14)
n classes in y: 3
50000 samples to predict
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 1070.539 1070.539 *\_lce.py:430(predict) <- current version
1 0.000 0.000 138.662 138.662 *\_lce.py:393(fit)
1 0.000 0.000 5.714 5.714 *\_lce.py:430(predict) <- patch version
Validation
First i checked if i had the same classification_report. But this can be misleading when running on small toydatasets. So i looked into the proba values which are returned to the bagging classifier and checked if they are the
Thank you for your suggestion and the analysis.
The new version 0.2.6 contains the speedup improvement inspired by your proposition, extended to the case of an input of dimension 1 and LCERegressor.
Problem
At the moment all predictions are done sequential. When predicting a few thousend samples it is slower than training the model itself.
Idea
The computational time can be decreased by a few orders of magnitude by predicting all samples of a node at once. And sorting the predictions to the order of the given X to predict on.
Speed comparision
I run my patch with two different datasets and used cProfile to messure the improvement. In both cases the computation time was improved by two orders of magnitude.
Validation
First i checked if i had the same
classification_report
. But this can be misleading when running on small toydatasets. So i looked into the proba values which are returned to the bagging classifier and checked if they are thePull request
I made a pull request 0cd62d6