CitrineInformatics / lolo

A random forest
Apache License 2.0
41 stars 12 forks source link

Enforce that all bags have at least one example of each output #281

Closed bfolie closed 2 years ago

bfolie commented 2 years ago

If a bag has no rows with a value for a specific output, that will lead to a GuessTheMeanLearner with no data and hence an exception on the empty .head call. This can happen even if we have a reasonable number of training rows for each output. If there are lots of training rows and they all have a value for output1 but only N have a value for output2, then the chance of creating a bag without any values of output2 is about exp(-N). For B bags it's about B * exp(-N). Settings B=64 and N=8, that works out to about 2%, which is too high.

This PR does two things. First, we require that there be at least 2 training rows with values for each output (if there's just 1 row for some output then it doesn't hit this error but it runs into another problem further down the line). Second, we ensure that each bag has at least 1 value of each output.