HenrikBengtsson / future.BatchJobs

:rocket: R package: future.BatchJobs: A Future API for Parallel and Distributed Processing using BatchJobs [Intentionally archived on CRAN on 2021-01-08]
https://cran.r-project.org/package=future.BatchJobs
8 stars 0 forks source link

FOR THE RECORD: When TORQUE/PBS kills a job using too much memory it causes a BatchJobs "expiration" error #57

Closed HenrikBengtsson closed 8 years ago

HenrikBengtsson commented 8 years ago

I'm adding this issue to record what is happening when TORQUE/PBS terminates a BatchJobs future due to too much memory usage:

Some of my BatchJobs futures use more (vmem) memory that requested/allotted by the TORQUE/PBS scheduler. The latter therefore terminates the future process. This results in the BatchJobs job getting status expired, which is translated into the following BatchJobsFutureError in the future.BatchJobs package:

Error in Exception(...) :
  BatchJobExpiration: Job of registry 'BatchJobs_1370412629' expired: /home/henrik/foo/.future/20160429_162220-kTLBcT/BatchJobs

The BatchJobs job output log file ends with:

=>> PBS: job killed: vmem 92954931200 exceeded limit 86973087744

This message is generated by TORQUE itself (sent to stderr) and the termination of the R process is done via a SIGTERM signal, cf. torque/src/resmom/mom_main.c.

Details

The PBS job that was created for the BatchJobs future was effectively submitted as:

qsub -l nodes=1:ppn=23 -l vmem=81gb main_future.pbs

which in turn would spawn of 22 multicore futures on the allotted compute node.

cclc01{henrik}: tail 20160429-162151.o366021
Resolving futures ...
Processing time this far:
Time difference of 26.7056 mins
Error in Exception(...) :
  BatchJobExpiration: Job of registry 'BatchJobs_1370412629' expired: /home/henrik/foo/.future/20160429_162220-kTLBcT/BatchJobs
_1370412629-files [DEBUG INFORMATION: BatchJobsFuture:; Expression:; {; mprintf("Permute across blocks (%d,1)-(%d,%d) ...\n", row,; row, nchrs); data_row <- listenv(); for (jj in seq_along(cols)) {; col <- cols[jj]; mprintf("Block (%d,%d) ...\n", row, col); seed_block <- seeds[[row, col]]; randomSeed("set", seed = seed_block, kind = "L'Ecuyer-CMRG"); seed_block_tag <- sprintf("seed_md5=%s", digest::digest(seed_block)); blockTag <- sprintf("block=%s_vs_%s", chrs[row], chrs[col]); ppTag <- sprintf("p=%d-%d", 1, P); fullname_row_col <- paste(c(blockTag, seed_block_tag,; ppTag), collapse = ","); filename_row_col <- sprintf("%s.rds", fullname_row_col); pathname_row_col <- file.path(pathD, filename_row_col); if (file_test("-f", pathname_row_col)) {; data_row_col <- readRDS(pathname_row_col); data_row[[col]] <- data_row_col; mprin
Calls: as.list ... value.BatchJobsFuture -> NextMethod -> value.Future
Execution halted
Warning message:
In delete.BatchJobsFuture(future, onRunning = "skip", onMissing = "ignore",  :
  Will not remove BatchJob registry, because the status of the BatchJobs was 'expired', 'started', 'submitted' and option 'future.delete' is not set to FALSE
: '/home/henrik/foo/.future/20160429_162220-kTLBcT/BatchJobs_1370412629-files'
cclc01{henrik}: tail /home/henrik/foo/.future/20160429_162220-kTLBcT/BatchJobs_1370412629-files/jobs/01/1.out
  ..- attr(*, "dim.")= int [1:2] 24319 13385
  ..- attr(*, "dimnames.")=List of 2
  .. ..$ chr2 : chr [1:24319] "chr2_0kb" "chr2_10kb" "chr2_20kb" "chr2_30kb" ...
  .. ..$ chr12: chr [1:13385] "chr12_0kb" "chr12_10kb" "chr12_20kb" "chr12_30kb" ...
  ..- attr(*, "class")= chr "RaoMatrix"
  ..- attr(*, "dimSwap")= logi FALSE
  ..- attr(*, "chromosomes")= chr [1:2] "2" "12"
  ..- attr(*, "resolution")= int 10000
 $ y   : num [1:13385] NaN NaN NaN NaN NaN ...
=>> PBS: job killed: vmem 92954931200 exceeded limit 86973087744

we see that the cluster scheduler ("PBS") has terminated this R process because it consumed 93.0 GB (=86.6 GiB) of RAM whereas it only requested 87.0 GB (=81.0 GiB) of RAM.