CitrineInformatics / lolo

A random forest
Apache License 2.0
41 stars 12 forks source link

Add `minDistinctLabels` to decision tree to prevent UQ collapse in Bagger #197

Open maxhutch opened 4 years ago

maxhutch commented 4 years ago

If the training labels have repeats of label values, then it is increasingly possible that every tree in the ensemble makes the same prediction (even if the input values are different). This could be prevented by imposing a minimum number of distinct label values in the leaves of the decision trees. That would significantly increase the likelihood that different trees had different pairs of label values in the leaf that hits a prediction, and therefore make different predictions, and therefore has some predictive uncertainty.

cc: @bfolie

maxhutch commented 4 years ago

An alternative: simply set a predicted uncertainty floor that depends on the variance of the training labels and the number of training rows.