Wikidata / soweego

Link Wikidata items to large catalogs
https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego_2
GNU General Public License v3.0
95 stars 8 forks source link

Add Decision Trees as a classifier #355

Closed tupini07 closed 4 years ago

tupini07 commented 4 years ago

Decision trees, or their more complex version random forest (which is just an ensemble of decision trees), are commonly used as a fast and effective classifier in ensembles of other algorithms.

We need to add this kind of classifier as one of those that can be used within the soweego environment.

tupini07 commented 4 years ago

According to sklearn's documentation about decision trees/random forest regarding why random forests are usually better:

.. individual decision trees typically exhibit high variance and tend to overfit. 
The injected randomness in forests yield decision trees with somewhat decoupled 
prediction errors. By taking an average of those predictions, some errors can 
cancel out.

Because of this it was chosen to only implement a random forest classifier

tupini07 commented 4 years ago

Random forest classifier has been added. Still need to do cross-validation to find the optimal hyperparameters

tupini07 commented 4 years ago

I started running the nested cross validation procedure, and after doing imdb/producer and musicbrainz/musician I noticed that the following hyperparameters are always set as so in the best models:

The last hyperaparam which we tune in the CV is the n_estimators. On average it seems that a value of 350 performs best.

Because of this, and because this grid search procedure is very slow, I've decided to use these as the final hyperparameters for random forest.