HenrikBengtsson / future.BatchJobs

:rocket: R package: future.BatchJobs: A Future API for Parallel and Distributed Processing using BatchJobs [Intentionally archived on CRAN on 2021-01-08]
https://cran.r-project.org/package=future.BatchJobs
8 stars 0 forks source link

TROUBLESHOOTING: Keep job registry files to simplify troubleshooting #56

Closed HenrikBengtsson closed 8 years ago

HenrikBengtsson commented 8 years ago

Background

When running script from the command line (e.g. Rscript foo.R) and that has one or more BatchJobs futures that fails due to expiration, there is little information do go buy after the main R session terminates (because futures are automagically cleaned up). All we have might be something like the following output:

Error in Exception(...) :
  BatchJobExpiration: Job of registry 'BatchJobs_2014070738' expired: /home/henrik/projects/foo/.future/20160424_205256-Nd6Fde/BatchJobs_2014070738-files [DEBUG INFORMATION: BatchJobsFuture:; Expression:; {; mprintf("Permute across blocks (%d,1)-(%d,%d) ...\n", row,; row, nchrs); data_row <- listenv(); for (jj in seq_along(cols)) {; col <- cols[jj]; mprintf("Block (%d,%d) ...\n", row, col); seed_block <- seeds[[row, col]]; randomSeed("set", seed = seed_block, kind = "L'Ecuyer-CMRG"); seed_block_tag <- sprintf("seed_md5=%s", digest::digest(seed_block)); blockTag <- sprintf("block=%s_vs_%s", chrs[row], chrs[col]); ppTag <- sprintf("p=%d-%d", 1, P); fullname_row_col <- paste(c(blockTag, seed_block_tag,; ppTag), collapse = ","); filename_row_col <- sprintf("%s.rds", fullname_row_col); pathname_row_col <- file.path(pathD, filename_row_col); if (file_test("-f", pathname_row_col)) {; data_row_col <- readRDS(pathname_row_col); data_row[[col]] <- data_row_col; mprin
Calls: as.list ... value.BatchJobsFuture -> NextMethod -> value.Future

Suggestions

  1. Try to extract more information from BatchJobs jobs that fails, e.g. capture the content of .out file.
  2. Add option not to delete BatchJobs registry files/directories if there was an error. Possibly, enable this option by default.
HenrikBengtsson commented 8 years ago

Implemented and works with the following toy example:

{hb}: Rscript -e "library(future.BatchJobs); plan(batchjobs); f %<-% { stop('woo') }; print(f)"
Loading required package: future
Loading required package: BatchJobs
Loading required package: BBmisc
Loading required package: methods
Error in Exception(...) :
  BatchJobError: 'Error in tryCatchList(expr, classes, parentenv, handlers) : woo ' [DEBUG INFORMATION: BatchJobsFuture:; Expression:; {; stop("woo"); }; Status: 'error', 'started', 'submitted'; Error: 'Error in tryCatchList(expr, classes, parentenv, handlers) : woo '; BatchJobs configuration:; Job registry:  BatchJobs_1159903939; Number of jobs:  1; Files dir: x:/future.BatchJobs/.future/20160429_162348-QJtsJh/BatchJobs_1159903939-files; Work dir: x:/future.BatchJobs; Multiple result files: FALSE; Seed: 910586702; Required packages: BatchJobs; Cluster functions: 'Local']
Calls: print ... value.BatchJobsFuture -> NextMethod -> value.Future
Execution halted
Warning message:
In delete.BatchJobsFuture(future, onRunning = "skip", onMissing = "ignore",  :
  Will not remove BatchJob registry, because the status of the BatchJobs was 'error' and option 'future.delete' is not set to FALSE: 'x:/future.BatchJobs/.future/20160429_162348-QJtsJh/BatchJobs_1159903939-files'Will not remove BatchJob registry, because the status of the BatchJobs was 'started' and option 'future.delete' is not set to FALSE: 'x:/future.BatchJobs/.future/20160429_162348-QJtsJh/BatchJobs_1159903939-files'Will not remove BatchJob registry, because the status of the BatchJobs was 'submitted' and option 'future.delete' is not set to FALSE: 'x:/future.BatchJobs/.future/20160429_162348-QJtsJh/BatchJobs_1159903939-files'

{hb}: dir "x:/future.BatchJobs/.future/20160429_162348-QJtsJh/BatchJobs_1159903939-files"
 Volume in drive X is Windows7_OS
 Volume Serial Number is E038-51CC

 Directory of x:\future.BatchJobs\.future\20160429_162348-QJtsJh\BatchJobs_1159903939-files

04/29/2016  04:23 PM    <DIR>          .
04/29/2016  04:23 PM    <DIR>          ..
04/29/2016  04:23 PM             5,120 BatchJobs.db
04/29/2016  04:23 PM               972 conf.RData
04/29/2016  04:23 PM    <DIR>          exports
04/29/2016  04:23 PM    <DIR>          functions
04/29/2016  04:23 PM    <DIR>          jobs
04/29/2016  04:23 PM    <DIR>          pending
04/29/2016  04:23 PM               613 registry.RData
04/29/2016  04:23 PM    <DIR>          resources
               3 File(s)          6,705 bytes
               7 Dir(s)   8,725,270,528 bytes free