If a bag has no rows with a value for a specific output, that will lead to a GuessTheMeanLearner with no data and hence an exception on the empty .head call. This can happen even if we have a reasonable number of training rows for each output. If there are lots of training rows and they all have a value for output1 but only N have a value for output2, then the chance of creating a bag without any values of output2 is about exp(-N). For B bags it's about B * exp(-N). Settings B=64 and N=8, that works out to about 2%, which is too high.
This PR does two things. First, we require that there be at least 2 training rows with values for each output (if there's just 1 row for some output then it doesn't hit this error but it runs into another problem further down the line). Second, we ensure that each bag has at least 1 value of each output.
If a bag has no rows with a value for a specific output, that will lead to a
GuessTheMeanLearner
with no data and hence an exception on the empty.head
call. This can happen even if we have a reasonable number of training rows for each output. If there are lots of training rows and they all have a value foroutput1
but onlyN
have a value foroutput2
, then the chance of creating a bag without any values ofoutput2
is aboutexp(-N)
. ForB
bags it's aboutB * exp(-N)
. SettingsB=64
andN=8
, that works out to about 2%, which is too high.This PR does two things. First, we require that there be at least 2 training rows with values for each output (if there's just 1 row for some output then it doesn't hit this error but it runs into another problem further down the line). Second, we ensure that each bag has at least 1 value of each output.