Closed rimorob closed 4 years ago
Re space usage: got to the bottom of that - it's the export folder of 0.5GB repeated 40 times. I think I've mentioned in another thread that large size of exports - which are shared among the repeats - makes for a very high disk usage. But that's not what's crashing the program. It seems to be trying to delete a file, possibly a file that doesn't exist. Might be a simple bug to fix?
Whatever it is, I'm 99.9999% certain it's nothing in the future framework per se. Segfaults indicates bugs in native code; future framework is all R code
Somewhat related: I've now documented option future.delete
, cf. commit d3d4f7e
Also, I see that you've moved this to https://github.com/mllg/batchtools/issues/266, so closing here.
I'm getting a segfault when running doFuture; the problem seems to be in the future.batchtools backend. Here's the stack trace: caught segfault address 0x7ffe38150ff8, cause 'memory not mapped'
Traceback: 1: dir_map(old, identity, all, recurse, type, fail) 2: dir_ls(old, type = "directory", recurse = TRUE, all = TRUE) 3: dir_delete(old[dirs]) 4: fs::file_delete(x[fs::file_exists(x)]) 5: file_remove(file) 6: (function (object, file, compress = "gzip") { file_remove(file) saveRDS(object, file = file, version = 2L, compress = compress) waitForFile(file, 300) invisible(TRUE)})(object = dots[[1L]]\ [[23L]], file = dots[[2L]][[23L]], compress = dots[[3L]][[1L]]) 7: mapply(FUN = f, ..., SIMPLIFY = FALSE) 8: Map(writeRDS, object = export, file = fn, compress = reg$compress) 9: batchExport(export = future$globals, reg = reg) 10: run.BatchtoolsFuture(future) 11: run(future) 12: batchtools_by_template(expr, envir = envir, substitute = FALSE, globals = globals, label = label, template = template, type = "slurm", resources = resources, workers = workers, registry = regi\ stry, ...) 13: makeFuture(...) 14: .makeFuture(expr, substitute = FALSE, envir = envir, globals = globals, packages = packages, seed = seed, lazy = lazy, ...) 15: future(expr, substitute = FALSE, envir = envir, globals = globals_ii, packages = packages_ii, seed = seed, stdout = stdout, conditions = conditions, label = labels[ii]) 16: e$fun(obj, substitute(ex), parent.frame(), e$data) 17: foreach(si = 1:self$nBoot, .options.future = list(scheduling = 5, future.delete = TRUE), .packages = c("tidyverse", "glinternet", "R6"), .export = c("self", "as.glm.glinternet.cv")) %dopar% \ { set.seed(si) if (self$debug) { sink("log.txt", append = TRUE) print(paste("job", si, "at time", date())) sink() } sIdx = sample(x = 1:nrow(\ X), size = round(nrow(X) trainFraction), replace = FALSE) maxVars = min(floor(trainFraction nrow(X)) - 1, ncol(X)) pbVec = aggWeights(self$featureWeights) if (trainFra\ ction < 1) { varIdx = sample(x = 1:ncol(X), size = maxVars, prob = pbVec) trainX = as.matrix(X[sIdx, varIdx]) testX = as.matrix(X[-sIdx, varIdx]) trainY = Y[sId\ x, , drop = F] testY = Y[-sIdx, , drop = F] if (self$randomize) { trainY = trainY %>% sample } if (self$debug) { sink("log.txt", a\ ppend = TRUE) print("pre-glinternet") sink() } } else { trainX = X[, self$bootColNames[[si]]] trainY = Y } nLevel\ s = numLevels(trainX) status = tryCatch({ cvModel = glinternet.cv(trainX, trainY, numLevels = nLevels, nFolds = self$nFolds, family = self$family, numCores = 1) },\ error = function(e) { return(-1) }, { }) if (class(status) != "glinternet.cv") { print("cv model crashed; retrying once") cvModel = glinternet.cv(tra\ inX, trainY, numLevels = nLevels, nFolds = self$nFolds, family = self$family, numCores = 1) } if (self$debug) { sink("log.txt", append = TRUE) print("po\ st-cast") sink() } if (trainFraction < 1) { print("fitting a glm") glmModel = as.glm(cvModel, testX, testY, rebuildInteractions = FALSE, simp\ lify = FALSE, k = log(nrow(trainX))) print("predicting out of sample") predOos = predict(cvModel, X = as.matrix(testX), lambdaType = "lambdaHat1Std", type = "response\ ") print("calculating r2") mRsq = do.call("glm", list(testY ~ predOos, data = data.frame(testY, predOos), family = self$family)) rsq = nagelkerke(mRsq)$Pse\ udo.R.squared.for.model.vs.null["Cox and Snell (ML)", ] weight = 1/(1 - rsq) } else { predOos = NA weight = NA glmModel = as.glm(c\ vModel, trainX, trainY, rebuildInteractions = FALSE, simplify = FALSE, k = log(nrow(trainX))) } print("here") fname = tempfile(pattern = "file", tmpdir = "/fsx/home/bh\ ayete/Projects/EMP-prioritization/tmp", fileext = paste(".", si, ".RData", sep = "")) save(list = ls(), file = fname) print("there") print(class(glmModel)) print("-\ --") featureWeights = private$calcFeatureWeights(glmModel, simplify = FALSE) rm(fname) return(list(mGli = cvModel, mGlm = glmModel, weight = weight, bootColNam\ es = colnames(trainX), featureWeights = featureWeights)) } 18: private$bootstrapGlinternetIteration(trainFraction = trainFraction)
^^^ this last one is my own function containing the foreach loop.
Other noteworthy things: the machine has 128GB RAM, lightly used except by me, and the remote workers also have a lot of available memory. In any case, segfault seems to happen locally and in relation to a file storage or removal operation. I can provide other information if it would be helpful. For instance, the final run of doFuture left a number of files on disk in the .future directory and they amount to 19GB. Not sure how it gets to that #, since the data frame I'm passing around is about 1000x1000.