Explore different ways in which we can ensemble the current classifiers

tupini07 commented 4 years ago

Consider the different ensemble methods presented in the literature. Choose some of them to use in this project.

tupini07 commented 4 years ago

A preliminary search of different ensemble methods show that the following are the most commonly used:

Bagging: Train multiple classifiers on the whole dataset (or by bootstrapping) and then join their predictions either by majority vote or averaging the predictions. This is exactly what was implemented in #305. This functionality is also provided by sklearn.ensemble.VotingClassifier
- An interesting alternative to this is a method called Gating, in which we do a weighted average of the prediction, and the optimal weights are learned by a single layer perceptron or some other classifier.
Boosting: Add models to the ensemble in sequence, using as training data for the next model those points which the current ensemble miss-predicted.
Stacking: This extends the idea of gating. In stacking we train a series of models, where each is trained on the output of the previous model (except for the first). Like a sequential layer of models. A common difficulty with stacking is finding the optimal combination/structure of models.

An interesting library to consider for this part is ML-Ensemble which provides an sklear-like API for its models, so it should be pretty easy to integrate with RecordLinkage

tupini07 commented 4 years ago

We already have baggining implemented by the VotingClassifier. We could implement Gating which is the natural extension. And then we can try Stacking to see if the performance changes.

Wikidata / soweego

Explore different ways in which we can ensemble the current classifiers #348