Closed tupini07 closed 4 years ago
According to sklearn's documentation about decision trees/random forest regarding why random forests are usually better:
.. individual decision trees typically exhibit high variance and tend to overfit.
The injected randomness in forests yield decision trees with somewhat decoupled
prediction errors. By taking an average of those predictions, some errors can
cancel out.
Because of this it was chosen to only implement a random forest classifier
Random forest classifier has been added. Still need to do cross-validation to find the optimal hyperparameters
I started running the nested cross validation procedure, and after doing imdb/producer
and musicbrainz/musician
I noticed that the following hyperparameters are always set as so in the best models:
bootstra=True
max_features=auto
The last hyperaparam which we tune in the CV is the n_estimators
. On average it seems that a value of 350
performs best.
Because of this, and because this grid search procedure is very slow, I've decided to use these as the final hyperparameters for random forest.
Decision trees, or their more complex version random forest (which is just an ensemble of decision trees), are commonly used as a fast and effective classifier in ensembles of other algorithms.
We need to add this kind of classifier as one of those that can be used within the
soweego
environment.