Wikidata / soweego

Link Wikidata items to large catalogs
https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego_2
GNU General Public License v3.0
97 stars 9 forks source link

Check if sklearn's VotingClassifier is better than our 'super-confident' implementation #363

Closed tupini07 closed 5 years ago

tupini07 commented 5 years ago

After reading the documentation for sklearn.VotingClassifier it seems that it does exactly what our implementation of super confident predictions (#305) does. The only difference is that sets of predictions are always joined by union.

This task is to evaluate the performance of sklearn.VotingClassifier. If it is better or the same as our current implementation then it would be a good idea to replace our implementation with this one. It will serve to reduce the amount of code in the project and we won't need to maintain this functionality.

tupini07 commented 5 years ago

After some testing it seems that:

  1. sklearn.VotingClassifier can only be composed of classifiers which have a predict_proba method (meaning we can't use LSVM)
  2. The performance of VotingClassifier is very similar with that of our method union average (which does use LSVM). It actually has a slightly better performance with similar standard deviation (see table below)

However, there seems to be an issue with our method when removing LSVM from the pool of classifiers: the performance in general greatly decreases, as can be seen below. Rather than debugging why this may be happening, we'll just go ahead use sklearn's implementation. And since we won't be using LSVM for this ensemble we'll add a new classifier to the pool so as to maintain diversity (issue #365)

Method Precision(Std) Recall(Std) F1(Std)
Union Average .895(.003) .967(.001) .930(.001)
Union Average (no LSVM) .303(.010) .515(.012) .381(.010)
VotingClassifier .907(.007) .972(.006) .938(.001)

NOTE: these metrics were obtained by running the evaluate procedure using discogs/musician as target