Bioconductor / BiocParallel

Bioconductor facilities for parallel evaluation
https://bioconductor.org/packages/BiocParallel
67 stars 29 forks source link

BiocParallel fail to start with MPI #120

Open raffaelepotami opened 4 years ago

raffaelepotami commented 4 years ago

Hello Everyone, We are having trouble running BiocParallel within our SLURM cluster environment.

The foo.R script we are trying to run is

library("BiocParallel")
library("Rmpi")

param <- SnowParam(workers = 3, type = "MPI")
FUN <- function(i) system("hostname", intern=TRUE)
bplapply(1:6, FUN, BPPARAM = param)

If we request an interactive job allocation, for example with salloc -p mpi -N 2 -n 4 -t 1:00:00 and then start R with: mpiexec -np 1 R --no-save and run the above script from this interactive shell we have as expected:

> library("BiocParallel")
library("BiocParallel")
> library("Rmpi")
library("Rmpi")
> param <- SnowParam(workers = 3, type = "MPI")
param <- SnowParam(workers = 3, type = "MPI")
> FUN <- function(i) system("hostname", intern=TRUE)
FUN <- function(i) system("hostname", intern=TRUE)
> bplapply(1:6, FUN, BPPARAM = param)
bplapply(1:6, FUN, BPPARAM = param)
    3 slaves are spawned successfully. 0 failed.
[[1]]
[1] "compute-a-16-21"

[[2]]
[1] "compute-a-16-21"

[[3]]
[1] "compute-a-16-22"

[[4]]
[1] "compute-a-16-22"

[[5]]
[1] "compute-a-16-22"

[[6]]
[1] "compute-a-16-22"

However if we try to run the same R script from within a sbatch job with:

#!/bin/bash

#SBATCH -p mpi
#SBATCH -N 2
#SBATCH -n 4
#SBATCH -t 2:00:00

mpiexec -np 1 Rscript foo.R  # or R CMD BATCH foo.R 

The execution hangs for several seconds and eventually fails with the MPI error:

[compute-a-16-21:10780] OPAL ERROR: Timeout in file base/pmix_base_fns.c at line 193
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_dpm_dyn_init() failed
  --> Returned "Timeout" (-15) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)

Does anyone have any idea of why the primary R process is failing to start the other tasks?

Thank you Raffaele

raffaelepotami commented 4 years ago

Update: starting the batch job with

mpiexec -np 1 R --no-save --file=foo.R

instead of R CMD BATCH or Rscript seems to work. The execution still ends with a bad OMPI since the task just dies out there, but at least it does run the hostname on the distributed system

nturaga commented 4 years ago

Can you try using the BiocParallel::BatchToolsParam() interface and try it on your SLURM cluster?