Closed tupini07 closed 4 years ago
A preliminary search of different ensemble methods show that the following are the most commonly used:
Bagging: Train multiple classifiers on the whole dataset (or by bootstrapping) and then join their predictions either by majority vote or averaging the predictions. This is exactly what was implemented in #305. This functionality is also provided by sklearn.ensemble.VotingClassifier
Boosting: Add models to the ensemble in sequence, using as training data for the next model those points which the current ensemble miss-predicted.
Stacking: This extends the idea of gating. In stacking we train a series of models, where each is trained on the output of the previous model (except for the first). Like a sequential layer of models. A common difficulty with stacking is finding the optimal combination/structure of models.
An interesting library to consider for this part is ML-Ensemble which provides an sklear-like API for its models, so it should be pretty easy to integrate with RecordLinkage
We already have baggining implemented by the VotingClassifier
. We could implement Gating
which is the natural extension. And then we can try Stacking
to see if the performance changes.
Consider the different ensemble methods presented in the literature. Choose some of them to use in this project.