I've been running tuning experiments that use the original training data to perform cross-validation for a different hyper parameter values. I keep getting memory allocation errors when trying some combination of hyper parameter values, especially for the CAMEO specification:
The problem is that randomForest ends up exhausting the available memory. I think I read that one solution is to increase the default maximum size of a vector in R, but I don't really want to stray into that kind of territory. The other option would be to use ranger, which has more efficient memory management (or at least it claims so). But then one could claim that any differences in results are due to using ranger over randomForest.
My impression is that this is partly related to high mtry and high ntree values, but I also suspect that the sample size for each tree plays a role. So far I've been using sampling equal to the number of training data rows with replacement, and only varying mtry, ntree, and nodesize, but not sampsize. With replacement, it is by default set to equal the number of rows in the training data.
The reason for not changing the sampling and setting it smaller is that in the training data cross-validation, even with splitting the training data in half, some folds end up with only 1 positive case. Each tree must have at least one positive case in its data sample, and sampling training data rows with replacement works for that. This is why B&S need to use regression, not classification trees. I suspect that almost all of the trees in the forests that B&S train end up with only 0 values in the outcome.
On the other hand, it is not possible to reliably increase the number of positives cases in the training split by changing to something like 3- or higher fold CV because in that case while the training split will end up with more positive cases, sometimes the test fold will end up with 0 positive cases. This makes AUC undefined.
So far only doing a half/half split via 2-fold CV seems to guarantee that both the training split and validation split have at least one of the 9 positive cases in the original B&S training split.
Two ways forward:
In the parallel loop in run-tune-experiments.R, catch errors (tryCatch) so that the worker can move on to the next task. Any hyper parameter set that produces an error can thus later just be invalidated.
In RF, try stratified sampling by the outcome so that sampsize can also become a tunable parameter while still ensuring that each data sample will contain at least one positive case.
The sampsize vector defines how many samples to take from the "1" and "0" cases.
RF seems to only allow sampling without replacement when using stratified sampling.
As a result, in sampsize, we cannot sample more "1" cases than there are in any particular CV train data split. Sometimes this value will be 1, sometimes more.
So, either: hold sampsize[1] constant at 1 and vary sampsize[2], or allow sampsize[1] to equal or vary up to the number of positives in a train split and vary sampsize[2] as well.
BTW, the prevalence in the training data is roughly 1 to 1300.
I think the first approach, where sampsize[1] is always 1, is easier. Otherwise there will be a dependency on the number of positives in a particular training split. I don't want to optimize over that.
I've been running tuning experiments that use the original training data to perform cross-validation for a different hyper parameter values. I keep getting memory allocation errors when trying some combination of hyper parameter values, especially for the CAMEO specification:
The problem is that
randomForest
ends up exhausting the available memory. I think I read that one solution is to increase the default maximum size of a vector in R, but I don't really want to stray into that kind of territory. The other option would be to useranger
, which has more efficient memory management (or at least it claims so). But then one could claim that any differences in results are due to usingranger
overrandomForest
.My impression is that this is partly related to high
mtry
and highntree
values, but I also suspect that the sample size for each tree plays a role. So far I've been using sampling equal to the number of training data rows with replacement, and only varyingmtry
,ntree
, andnodesize
, but notsampsize
. With replacement, it is by default set to equal the number of rows in the training data.The reason for not changing the sampling and setting it smaller is that in the training data cross-validation, even with splitting the training data in half, some folds end up with only 1 positive case. Each tree must have at least one positive case in its data sample, and sampling training data rows with replacement works for that. This is why B&S need to use regression, not classification trees. I suspect that almost all of the trees in the forests that B&S train end up with only 0 values in the outcome.
On the other hand, it is not possible to reliably increase the number of positives cases in the training split by changing to something like 3- or higher fold CV because in that case while the training split will end up with more positive cases, sometimes the test fold will end up with 0 positive cases. This makes AUC undefined.
So far only doing a half/half split via 2-fold CV seems to guarantee that both the training split and validation split have at least one of the 9 positive cases in the original B&S training split.
Two ways forward:
run-tune-experiments.R
, catch errors (tryCatch
) so that the worker can move on to the next task. Any hyper parameter set that produces an error can thus later just be invalidated.sampsize
can also become a tunable parameter while still ensuring that each data sample will contain at least one positive case.This would basically look like this:
Some notes on this:
sampsize
vector defines how many samples to take from the "1" and "0" cases.sampsize
, we cannot sample more "1" cases than there are in any particular CV train data split. Sometimes this value will be 1, sometimes more.sampsize[1]
constant at 1 and varysampsize[2]
, or allowsampsize[1]
to equal or vary up to the number of positives in a train split and varysampsize[2]
as well.BTW, the prevalence in the training data is roughly 1 to 1300.
I think the first approach, where
sampsize[1]
is always 1, is easier. Otherwise there will be a dependency on the number of positives in a particular training split. I don't want to optimize over that.