HenrikBengtsson / CostelloPSCNSeq

R package: Parent-specific Copy-number Estimation Pipeline using HT-Seq Data
3 stars 2 forks source link

future.batchtools not connecting to Torque server #29

Closed ivan108 closed 3 years ago

ivan108 commented 3 years ago

I am getting an error when trying to run 1.mpileup.R, which is the first step of the pipeline.

module load CBC r/3.4.4 cd /home/jocostello/repositories/HenrikBengtsson/Costello-PSCN-Seq Rscript 0.setup.R qsub -l vmem=200gb -d "${PWD}" -M "${EMAIL}" -m ae 1.mpileup.pbs

Error: Listing of jobs failed (exit code 33);
cmd: 'qselect -u $USER -s EHRT'
output:
socket_connect_unix failed: 15137
qselect: cannot connect to server (null) (errno=15137) could not connect to trqauthd
Execution halted
Error : Listing of jobs failed (exit code 33);
cmd: 'qselect -u $USER -s EHRT'
output:
socket_connect_unix failed: 15137
qselect: cannot connect to server (null) (errno=15137) could not connect to trqauthd

It seems batchtools fail to connect to torque server? Any ideas how to fix that?

Thanks! Ivan cc/ @SRHilz

HenrikBengtsson commented 3 years ago

FYI, I'm on this but I realize this package hasn't got much love during the last 12-15 months and is now a bit tricky to install and test. I'm working on some updates upstream, e.g. in aroma.seq. Stay tuned.

HenrikBengtsson commented 3 years ago

I can reproduce this with R 3.6.3 as:

$ module load CBC r
$ R --vanilla
R version 3.6.3 (2020-02-29) -- "Holding the Windsock"
Copyright (C) 2020 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
...
> library(future.batchtools)
Loading required package: future
> plan(batchtools_torque)
> f <- future(Sys.info())
Error: Listing of jobs failed (exit code 33);
cmd: 'qselect -u $USER -s EHRT'
output:
socket_connect_unix failed: 15137
qselect: cannot connect to server (null) (errno=15137) could not connect to trqauthd

The traceback is:

> traceback()
14: stop(simpleError(sprintf(...), call = NULL))
13: stopf("%s (exit code %i);\ncmd: '%s'\noutput:\n%s", msg, exit.code, 
        cmd, output)
12: OSError("Listing of jobs failed", res)
11: listJobs(reg, args)
10: cf$listJobsRunning(reg)
9: unique(cf$listJobsRunning(reg))
8: getBatchIds(reg, status = "all")
7: .findOnSystem(reg = reg, cols = c("job.id", "batch.id"))
6: batchtools::submitJobs(reg = reg, ids = jobid, resources = resources)
5: run.BatchtoolsFuture(future)
4: run(future)
3: batchtools_by_template(expr, envir = envir, substitute = FALSE, 
       globals = globals, label = label, template = template, type = "torque", 
       resources = resources, workers = workers, registry = registry, 
       ...)
2: makeFuture(expr, substitute = FALSE, envir = envir, lazy = lazy, 
       seed = seed, globals = globals, packages = packages, label = label, 
       gc = gc, ...)
1: future(Sys.info())

and session details are:

> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)

Matrix products: default
BLAS:   /home/shared/cbc/software_cbc/R/R-3.6.3/lib64/R/lib/libRblas.so
LAPACK: /home/shared/cbc/software_cbc/R/R-3.6.3/lib64/R/lib/libRlapack.so

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] future.batchtools_0.10.0 future_1.21.0           

loaded via a namespace (and not attached):
 [1] parallelly_1.23.0 magrittr_2.0.1    hms_1.0.0         progress_1.2.2   
 [5] rappdirs_0.3.3    debugme_1.1.0     R6_2.5.0          brew_1.0-6       
 [9] rlang_0.4.10      globals_0.14.0    tools_3.6.3       parallel_3.6.3   
[13] checkmate_2.0.0   data.table_1.13.6 withr_2.4.1       ellipsis_0.3.1   
[17] base64url_1.4     digest_0.6.27     tibble_3.0.6      lifecycle_0.2.0  
[21] crayon_1.4.1      fs_1.5.0          vctrs_0.3.6       batchtools_0.9.15
[25] codetools_0.2-16  stringi_1.5.3     pillar_1.4.7      compiler_3.6.3   
[29] backports_1.2.1   prettyunits_1.1.1 listenv_0.8.0     pkgconfig_2.0.3

This problem is independent of this package. It could be related to future.batchtools, batchtools, or simply just to the TIPCC cluster itself.

HenrikBengtsson commented 3 years ago

Oh... this is completely unrelated to R. It looks like there's a problem with development node n6 and some of the other nodes;

[henrik@n6 ~]$ qselect -u $USER -s EHRT
socket_connect_unix failed: 15137
qselect: cannot connect to server (null) (errno=15137) could not connect to trqauthd

[henrik@n6 ~]$ qstat
socket_connect_unix failed: 15137
socket_connect_unix failed: 15137
socket_connect_unix failed: 15137
qstat: cannot connect to server (null) (errno=15137) could not connect to trqauthd

I'll move this to our TIPCC tracker. Closing here.

HenrikBengtsson commented 3 years ago

This should have been fixed now (https://github.com/UCSF-TI/TIPCC/issues/337). Verified on n6 using:

> library(future.batchtools)
Loading required package: future
> plan(batchtools_torque)
> f <- future(Sys.info())
> info <- value(f)
> str(as.list(info))
List of 8
 $ sysname       : chr "Linux"
 $ release       : chr "2.6.32-504.12.2.el6.664g0000.x86_64"
 $ version       : chr "#1 SMP Wed Mar 11 14:20:51 EDT 2015"
 $ nodename      : chr "n7"
 $ machine       : chr "x86_64"
 $ login         : chr "unknown"
 $ user          : chr "henrik"
 $ effective_user: chr "henrik"
>

Try again.