amkozlov / raxml-ng

RAxML Next Generation: faster, easier-to-use and more flexible
GNU Affero General Public License v3.0
379 stars 64 forks source link

autoMRE hangs w/ MPI #74

Closed brantfaircloth closed 4 years ago

brantfaircloth commented 4 years ago

We've observed a problem in 0.9.0 where running autoMRE with MPI seems to hang once 50 bootstrap replicates are generated (and the code would typically evaluate them for convergence and then move on if not converged).

An ancillary issue that arises with this problem is that neither the bootstrap replicates nor the best ML tree files are visible when the run dies. Is there a way to extract these from the *.ckp files?

We've been calling RAxML pretty simply (so far - e.g. no hybrid mode) within the following:

#!/bin/bash
#PBS -q checkpt
#PBS -A hpc_allbirds03
#PBS -l walltime=72:00:00
#PBS -l nodes=13:ppn=20
#PBS -V
#PBS -N raxmlng-std-mpi
#PBS -o raxmlng-std-mpi.out
#PBS -e raxmlng-std-mpi.err

module load gcc/6.4.0
module load impi/2018.0.128

cd $PBS_O_WORKDIR
SEED=$RANDOM
echo $SEED

mpiexec -np 260 -machinefile $PBS_NODEFILE /project/brant/shared/bin/raxml-ng-mpi \
    --msa 75p_alignments.phylip.raxml.rba \
    --seed $SEED \
    --all \
    --bs-trees autoMRE

Thanks very much, b

brantfaircloth commented 4 years ago

Just to answer the ancillary part of my question about the "missing" files, in case anyone hits this issue and needs a workaround: I just started the job back up from the *.ckp file and specified 50 bootstrap replicates (see below). This created the desired files.

#!/bin/bash
#PBS -q checkpt
#PBS -A hpc_allbirds03
#PBS -l walltime=72:00:00
#PBS -l nodes=13:ppn=20
#PBS -V
#PBS -N raxmlng-std-mpi
#PBS -o raxmlng-std-mpi.out
#PBS -e raxmlng-std-mpi.err

module load gcc/6.4.0
module load impi/2018.0.128

cd $PBS_O_WORKDIR
SEED=$RANDOM
echo $SEED

mpiexec -np 260 -machinefile $PBS_NODEFILE /project/brant/shared/bin/raxml-ng-mpi \
    --msa 75p_alignments.phylip.raxml.rba \
    --seed $SEED \
    --all \
    --bs-trees 50
amkozlov commented 4 years ago

Thanks for reporting! This should be fixed now with the last commit (https://github.com/amkozlov/raxml-ng/commit/66ad9d22333263f6d1d3643328bd3bed8721dc6a). Could you please confirm?

amkozlov commented 4 years ago

Closing, please feel free to reopen if the problem persists.