Bioconductor / BiocParallel

Bioconductor facilities for parallel evaluation
https://bioconductor.org/packages/BiocParallel
67 stars 29 forks source link

BiocParallel does not keep logs during job failure #103

Open andr-kun opened 5 years ago

andr-kun commented 5 years ago

Hello,

I am currently running a parallel jobs (with bplapply) on an LSF cluster using BatchToolsParam and I found an issue where there are no logs produced in runs that have a few jobs failing.

Here is the error log from the failed run:

Quitting from lines 86-88 (demultiplex.Rmd) 
Error in .reduceResultsList(ids, fun, ..., missing.val = missing.val,  : 
  All jobs must be have been successfully computed
Calls: <Anonymous> ... bplapply -> bplapply -> <Anonymous> -> .reduceResultsList
Execution halted

When I tried to check the logs from the batchtools jobs, I noticed that there was no logs being produced at all which made figuring out the reason for the job failures difficult. I eventually managed to capture the logs manually by copying the temporary registry directory before the bplapply job finishes, where I found that the cause of the job failure is due to a missing executable in a few of the cluster nodes, resulting in the job exiting before R was even executed.

It would be really useful to actually be able to get the logs from the batchtools jobs even if some of the jobs failed to execute R, especially in LSF cluster as the logs contain the job execution information.

nturaga commented 5 years ago

Hi @andr-kun

it is possible to saveregistry=TRUE to avoid deleting your logs.

> BiocParallel::BatchtoolsParam
function (workers = batchtoolsWorkers(cluster), cluster = batchtoolsCluster(),
    registryargs = batchtoolsRegistryargs(), saveregistry = FALSE,
    resources = list(), template = batchtoolsTemplate(cluster),
    stop.on.error = TRUE, progressbar = FALSE, RNGseed = NA_integer_,
    timeout = 30L * 24L * 60L * 60L, exportglobals = TRUE, log = FALSE,
    logdir = NA_character_, resultdir = NA_character_, jobname = "BPJOB")

saveregistry: 'logical(1)'
     Option given to store the entire registry for the job(s). This
     functionality should only be used when debugging. The storage of
     the entire registry can be time and space expensive on disk. The
     registry will be saved in the directory specified by 'file.dir' in
     'registryargs'; the default locatoin is the current working
     directory. The saved registry directories will have suffix "-1",
     "-2" and so on, for each time the 'BatchtoolsParam' is used.

Note: Since this process of saving the entire registry can be expensive, please submit a smaller job to debug if you have cluster limitations. Otherwise, you can inspect the logs of your entire job with this options.

andr-kun commented 5 years ago

Thanks @nturaga for the information regarding saveregistry=TRUE. This would definitely be useful for debugging smaller job as you mentioned.

The problem with my situation is that some nodes in the cluster can just fail without any warning - so most of the jobs will actually work and suddenly a few jobs will start failing due to being assigned to failed nodes, which stops the entire bplapply run. Given that the run can take hours to finish, I am now looking into bptry and BPREDO in order to try and recover from the failed jobs.

From the testing I have done with bptry and BPREDO, I noticed that there is still no logs produced by BiocParallel in cases where the job failed to even start (with the same error of All jobs must be have been successfully computed returned by bptry(bplapply(...))). It would be really helpful if BiocParallel can actually recover the logs for these cases as it can be used for reporting issues to the cluster administrator and for blacklisting the nodes for future job runs. This is especially needed for LSF clusters as the logs in LSF clusters are only available from the log files itself*, rather than from the cluster management software like in SLURM.

* There is a possibility of getting the logs from the cluster management software in LSF, but this is only kept for a short period of time.