:rocket: R package: future.BatchJobs: A Future API for Parallel and Distributed Processing using BatchJobs [Intentionally archived on CRAN on 2021-01-08]
I'm adding this issue to record what is happening when TORQUE/PBS terminates a BatchJobs future due to too much memory usage:
Some of my BatchJobs futures use more (vmem) memory that requested/allotted by the TORQUE/PBS scheduler. The latter therefore terminates the future process. This results in the BatchJobs job getting status expired, which is translated into the following BatchJobsFutureError in the future.BatchJobs package:
Error in Exception(...) :
BatchJobExpiration: Job of registry 'BatchJobs_1370412629' expired: /home/henrik/foo/.future/20160429_162220-kTLBcT/BatchJobs
This message is generated by TORQUE itself (sent to stderr) and the termination of the R process is done via a SIGTERM signal, cf. torque/src/resmom/mom_main.c.
Details
The PBS job that was created for the BatchJobs future was effectively submitted as:
which in turn would spawn of 22 multicore futures on the allotted compute node.
cclc01{henrik}: tail 20160429-162151.o366021
Resolving futures ...
Processing time this far:
Time difference of 26.7056 mins
Error in Exception(...) :
BatchJobExpiration: Job of registry 'BatchJobs_1370412629' expired: /home/henrik/foo/.future/20160429_162220-kTLBcT/BatchJobs
_1370412629-files [DEBUG INFORMATION: BatchJobsFuture:; Expression:; {; mprintf("Permute across blocks (%d,1)-(%d,%d) ...\n", row,; row, nchrs); data_row <- listenv(); for (jj in seq_along(cols)) {; col <- cols[jj]; mprintf("Block (%d,%d) ...\n", row, col); seed_block <- seeds[[row, col]]; randomSeed("set", seed = seed_block, kind = "L'Ecuyer-CMRG"); seed_block_tag <- sprintf("seed_md5=%s", digest::digest(seed_block)); blockTag <- sprintf("block=%s_vs_%s", chrs[row], chrs[col]); ppTag <- sprintf("p=%d-%d", 1, P); fullname_row_col <- paste(c(blockTag, seed_block_tag,; ppTag), collapse = ","); filename_row_col <- sprintf("%s.rds", fullname_row_col); pathname_row_col <- file.path(pathD, filename_row_col); if (file_test("-f", pathname_row_col)) {; data_row_col <- readRDS(pathname_row_col); data_row[[col]] <- data_row_col; mprin
Calls: as.list ... value.BatchJobsFuture -> NextMethod -> value.Future
Execution halted
Warning message:
In delete.BatchJobsFuture(future, onRunning = "skip", onMissing = "ignore", :
Will not remove BatchJob registry, because the status of the BatchJobs was 'expired', 'started', 'submitted' and option 'future.delete' is not set to FALSE
: '/home/henrik/foo/.future/20160429_162220-kTLBcT/BatchJobs_1370412629-files'
cclc01{henrik}: tail /home/henrik/foo/.future/20160429_162220-kTLBcT/BatchJobs_1370412629-files/jobs/01/1.out
..- attr(*, "dim.")= int [1:2] 24319 13385
..- attr(*, "dimnames.")=List of 2
.. ..$ chr2 : chr [1:24319] "chr2_0kb" "chr2_10kb" "chr2_20kb" "chr2_30kb" ...
.. ..$ chr12: chr [1:13385] "chr12_0kb" "chr12_10kb" "chr12_20kb" "chr12_30kb" ...
..- attr(*, "class")= chr "RaoMatrix"
..- attr(*, "dimSwap")= logi FALSE
..- attr(*, "chromosomes")= chr [1:2] "2" "12"
..- attr(*, "resolution")= int 10000
$ y : num [1:13385] NaN NaN NaN NaN NaN ...
=>> PBS: job killed: vmem 92954931200 exceeded limit 86973087744
we see that the cluster scheduler ("PBS") has terminated this R process because it consumed 93.0 GB (=86.6 GiB) of RAM whereas it only requested 87.0 GB (=81.0 GiB) of RAM.
I'm adding this issue to record what is happening when TORQUE/PBS terminates a BatchJobs future due to too much memory usage:
Some of my BatchJobs futures use more (
vmem
) memory that requested/allotted by the TORQUE/PBS scheduler. The latter therefore terminates the future process. This results in the BatchJobs job getting statusexpired
, which is translated into the followingBatchJobsFutureError
in thefuture.BatchJobs
package:The BatchJobs job output log file ends with:
This message is generated by TORQUE itself (sent to stderr) and the termination of the R process is done via a
SIGTERM
signal, cf.torque/src/resmom/mom_main.c
.Details
The PBS job that was created for the BatchJobs future was effectively submitted as:
which in turn would spawn of 22 multicore futures on the allotted compute node.
we see that the cluster scheduler ("PBS") has terminated this R process because it consumed 93.0 GB (=86.6 GiB) of RAM whereas it only requested 87.0 GB (=81.0 GiB) of RAM.