andybega / Blair-Sambanis-replication

Replication of 'Forecasting Civil Wars: Theory and Structure in an Age of “Big Data” and Machine Learning' by Blair and Sambanis 2020
1 stars 0 forks source link

Memory allocation errors in RF tuning experiments #1

Closed andybega closed 4 years ago

andybega commented 4 years ago

I've been running tuning experiments that use the original training data to perform cross-validation for a different hyper parameter values. I keep getting memory allocation errors when trying some combination of hyper parameter values, especially for the CAMEO specification:

Error in { : task 3 failed - "vector memory exhausted (limit reached?)"
Calls: sourceWithProgress -> eval -> eval -> %dopar% -> <Anonymous>
Execution halted

The problem is that randomForest ends up exhausting the available memory. I think I read that one solution is to increase the default maximum size of a vector in R, but I don't really want to stray into that kind of territory. The other option would be to use ranger, which has more efficient memory management (or at least it claims so). But then one could claim that any differences in results are due to using ranger over randomForest.

My impression is that this is partly related to high mtry and high ntree values, but I also suspect that the sample size for each tree plays a role. So far I've been using sampling equal to the number of training data rows with replacement, and only varying mtry, ntree, and nodesize, but not sampsize. With replacement, it is by default set to equal the number of rows in the training data.

The reason for not changing the sampling and setting it smaller is that in the training data cross-validation, even with splitting the training data in half, some folds end up with only 1 positive case. Each tree must have at least one positive case in its data sample, and sampling training data rows with replacement works for that. This is why B&S need to use regression, not classification trees. I suspect that almost all of the trees in the forests that B&S train end up with only 0 values in the outcome.

On the other hand, it is not possible to reliably increase the number of positives cases in the training split by changing to something like 3- or higher fold CV because in that case while the training split will end up with more positive cases, sometimes the test fold will end up with 0 positive cases. This makes AUC undefined.

So far only doing a half/half split via 2-fold CV seems to guarantee that both the training split and validation split have at least one of the 9 positive cases in the original B&S training split.

Two ways forward:

This would basically look like this:

fitted_mdl <- randomForest(y = train_df$incidence_civil_ns_plus1,
                           x = train_df[, escalation],
                           type = "classification",
                           ntree = 1000,
                           mtry  = 3,
                           nodesize = 1,
                           strata = train_df$incidence_civil_ns_plus1,
                           sampsize = c(1, 1000),
                           replace = FALSE,
                           do.trace = FALSE)

Some notes on this:

BTW, the prevalence in the training data is roughly 1 to 1300.

I think the first approach, where sampsize[1] is always 1, is easier. Otherwise there will be a dependency on the number of positives in a particular training split. I don't want to optimize over that.

andybega commented 4 years ago

This seems to be resolved when using stratified sampling and sampling 3000 or less rows.