Closed jedaniels-ucd closed 5 years ago
Which algorithms did you include in the SL.library? In your CV.SuperLearner
call, did you set the value for the parallel argument or use the default?
Algorithms appear to include: SL.library <- list("SL.mean", "SL.gbmmini2", "SL.nnet", "SL.glmnet", "SL.bayesglm", "SL.xgboost", "SL.gbmmini3")
CV.Superlearner is using parallel=”multicore” although the same error occurs whether you specify parallel or not.
[snip]
I suspect it is one of the algorithms in the library causing the problem, if you remove the gbm algorithms (gbmmini[2|3] and xgboost) does it still give the error?
Removing gbm* and xgboost does stabilize behavior. From a substantive point of view I am not sure of the consequences of removing these algorithms. Any suggestions for trying to salvage them if it becomes necessary? The one job that ran successfully might have just been a lucky seed or something?
Addendum: Just noticed gbmmini2/3 are custom algorithms, so that sounds more like the PI's problem than an actual SuperLearner bug.
Might need to look at the content of the gbmmini2/3 algorithms to see if they are spawning multiple jobs and if that can be restricted (you can test if they are in fact the issue by adding back in just the xgboost). It does sound like one of the algorithms is causing the problem, not he SuperLearner code. In this case, you might be able to modify the custom algorithm to control it.
Recurring problem where SuperLearner (CV.SuperLearner function specifically), no matter how many mc.cores are specified, continues to spawn workers until the entire system dies under the weight of the jobs. Replicated this under several builds using different LINUX distributions (Redhat, Amazon, Ubuntu). Was able to accidentally build one machine that works properly, the rest all failed. Manually creating, running, stopping workers interactively in R seems to work fine.