flass / pantagruel

a pipeline for reconciliation of phylogenetic histories within a bacterial pangenome
GNU General Public License v3.0
46 stars 7 forks source link

Task 6: HPC scripts #36

Closed megaptera-helvetiae closed 4 years ago

megaptera-helvetiae commented 4 years ago

Hi Florent!

Thanks for implementing and explaining the new HPC scripts here: https://github.com/flass/pantagruel#hpc-scripts-submission-of-intensive-tasks-to-high-performance-computer-clusters

I am running my whole pantagruel pipeline on an HPC cluster. Hence, I am a bit confused. Can I start directly with running the following scripts:


- pantagruel_pipeline_06.1_HPC_full_ML_gene_trees.sh
- pantagruel_pipeline_06.2_HPC_collapse_gene_trees.sh
- pantagruel_pipeline_06.3_HPC_bayesian_gene_trees.sh
- pantagruel_pipeline_06.4_HPC_replace_spe_by_pops.sh
- pantagruel_pipeline_06.5_HPC_populate_db_collapsed_clades.sh

Or do I have to run these two commands beforehand:


pantagruel -i previous_config_file --refresh --submit_hpc hpchost:/where/you/will/set/your/database init
pantagruel -i previous_config_file 06

I am asking because the second of these two seems to take forever and I wonder whether pantagruel is trying to run the whole task 6 instead of only identifying the gene families.

What 'database' are you referring to in the manual?

Thank you!

megaptera-helvetiae commented 4 years ago

task6_HPC

I mean, can I kill the job here and use /scratch/clamchatka/Panta/test9/06.gene_trees/cdsfams_minsize4 as the database for the 5 HPC scripts?

megaptera-helvetiae commented 4 years ago

Also, you might want to pay attention to this:

[wilkins@gorilla test9]$ /scratch/clamchatka/Panta/pantagruel/pantagruel -i environ_pantagruel_test9.sh --refresh --submit_hpc hpchost:/scratch/clamchatka/Panta/test9 init
This is Pantagruel pipeline version 4867c048788ba7ec92dfd5ae9148d0349411151c using source code from repository '/scratch/clamchatka/Panta/pantagruel'
set address of database root folder on remote HPC cluster server
# will run tasks: init
Will refresh the database settings from the configuration file '/scratch/clamchatka/Panta/test9/environ_pantagruel_test9.sh'
re-run previous command line:
(appending the new arguments: submit_hpc hpchost:/scratch/clamchatka/Panta/test9  init)
# pantagruel -d test9 -r /scratch/clamchatka/Panta/ -a /scratch/clamchatka/Panta/user_genomes/ -T /scratch/clamchatka/Panta/NCBI/Taxonomy_2019-11-12/ submit_hpc hpchost:/scratch/clamchatka/Panta/test9  init
/scratch/clamchatka/Panta/pantagruel/scripts/pipeline/pantagruel_pipeline_master.sh: line 649: pantagruel: command not found
flass commented 4 years ago

Hi Laetitia, To your 1st and 2nd message : yes you can just carry on with the task 06.1 HPC script once the gene family list file was created. The script is probably stuck in the following lines trying to copy the files from the host to itself; I should put a clause for a clean exit when it is the case (that you ran the previous tasks on the same host as the HPC submission server). Unless, as you suggest, it's just going down the normal route of the pipeline and started computing the raxml trees. In any case, yes you should stop the job now. Then, can you please report here the output of a ls -ltrh 06.gene_trees/ command so I can see where the piprline was routed? After that you can proceed with running the pantagruel_pipeline_06.1_HPC_full_ML_gene_trees.sh script. If some raxml gene trees were already computed, you can add the option --resume to keep those and only compute the rest.

By "database" I mean your pantagruel dataset confined to the top folder (in your case 'test9').

About the last comment: yes something seems not right there; I'll have a look (between family dinners... don't expect it to come too fast!).

Florent

megaptera-helvetiae commented 4 years ago
[wilkins@gorilla test9]$ ls -ltrh 06.gene_trees/
total 497K
drwxr-sr-x. 2 wilkins login    0 Dec 20 22:31 raxml_trees
-rw-r--r--. 1 wilkins login  65K Dec 20 22:31 cdsfams_minsize4
-rw-r--r--. 1 wilkins login 431K Dec 20 22:32 cdsalifastacode_aln_list
drwxr-sr-x. 3 wilkins login    1 Dec 20 23:12 fullgenetree_mrbayes_trees
drwxr-sr-x. 3 wilkins login    3 Dec 20 23:12 fullgenetree_cdsfam_alignments_species_code
megaptera-helvetiae commented 4 years ago

And here a first trial to run the HPC script:

[wilkins@gorilla Panta]$ /scratch/clamchatka/Panta/pantagruel/scripts/pipeline/pantagruel_pipeline_06.1_HPC_full_ML_gene_trees.sh --mem 32 --ncpus 4 --wth 24 --hpctype 'LSF'
will request 32 GB for submitted jobs
will run jobs using 4 parallel threads
will request 24 hours walltime for submitted jobs
will use job submission syntax of HPC system: 'LSF'
Error: missing mandatory parameter: pantagruel config file

Usage:
 For running the script: with [option (default)]
    pantagruel_pipeline_06.1_HPC_full_ML_gene_trees.sh [--ncpus (auto-configure)] [--mem (auto-configure)] [--wth (auto-configure)] [--hpctype ('PBS')] [--chunksize (1000)] ptg_config_file
 or for this help mesage:
    pantagruel_pipeline_06.1_HPC_full_ML_gene_trees.sh -h

    Option details:
     --ncpus:   number of parallel threads (default: auto-configure).
     --mem: max memory allowance, in gigabytes (GB) (default: auto-configure).
     --wth: max walltime, in hours (default: auto-configure).
     --hpctype: 'PBS' for Torque scheduling system, or 'LSF' for IBM schedulling system (default: 'PBS').
     --chunksize:   number of families to be processed in one submitted job (default: 1000)
     --fwdenv:  names of current environment variables you want to be passed down to the submitted jobs
             if several, separate by ',' and enclose in quotes, e.g.: 'VAR1, VAR2' 
     --parallelflags:   additional flags to be specified for worker nodes' resource requests
             for instance, specifying parallelflags="mpiprocs=1:ompthreads=8" (with hpctype='PBS')
             will force the use of OpenMP multi-threading instead of default MPI parallelism.
             Note that parameter declaration syntax is not standard and differ depending on your HPC system;
             also the relevant parameter flags may be different depending on the HPC system config.
             Get in touch with your system admins to know the relevant flags and syntax.

 Note that the python scripts underlying this task require some non-standard packages.
 It is advised to use Ancaonda to set up a stable environment to run these scripts
 across the worker nodes of your HPC system. The job-submitted wrapper script will first attempt to load a conda environment
 named 'ptgenv', which can be created (just once) using:
    ```conda create -n ptgenv python=2.7 pip
    \ \ \ conda activate ptgenv
    \ \ \ pip install scipy numpy biopython bcbio-gff Cython igraph psycopg2```

 If not found, the script will attempt to load Anaconda and attached python modules with the `module load anaconda3/personal` command (not standard, likely not available on your cluster).
flass commented 4 years ago

oops my bad!

I forgot to put that one needs to indicate the pantagruel database configuration file as the last positional in the [doc](), even though it is mandatory! (corrected in 7b24b1b)

this was however made clear in the help/usage message above! the correct syntax is thus:

pantagruel_pipeline_06.1_HPC_full_ML_gene_trees.sh [--ncpus int] [--mem int] [--wth int] [--hpctype {'PBS'|'LSF'}] [--chunksize int ] ptg_config_file

Given what your ls -ltrh 06.gene_trees/ command returned, it indeed seems that no raxml tree was computed and thus that nothing happened further than generating the list of gene families. You can thus go for a clean start with task 06.1

flass commented 4 years ago

about the issue reported in the 3rd comment (error line 649 in master script): I think this is something that I already fixed with commit dfa1115. Please update the scripts and let me know if this persists.

I hope that this is enough fixes so that Santa brings you lots of gene trees!

Best,

Florent

megaptera-helvetiae commented 4 years ago

Bonne année!!!

Here we are on the other side.

I have a qsub/bsub problem. Our HPC uses slurm. Also, I think I would need administrator rights to use them.

Is there a workaround?

See my errors here:

/scratch/clamchatka/Panta/pantagruel/scripts/pipeline/pantagruel_pipeline_06.1_HPC_full_ML_gene_trees.sh --mem 32 --ncpus 4 --wth 24 --hpctype 'PBS' /scratch/clamchatka/Panta/test9/environ_pantagruel_test9.sh --resume

will request 32 GB for submitted jobs
will run jobs using 4 parallel threads
will request 24 hours walltime for submitted jobs
will use job submission syntax of HPC system: 'PBS'
will resume computation of task where it was last stopped
will re-use previously established lists of gene families for which to compute trees
cdsfam2phylo=/scratch/clamchatka/Panta/test9/06.gene_trees/cdsfams_minsize4 ; mem_per_core=32gb ; walltime=24:00:00 ; ncpus=4
Resume task 6: 3968 ML trees left to infer
jobrange=1-1000
/scratch/clamchatka/Panta/pantagruel/scripts/pipeline/pantagruel_pipeline_06.1_HPC_full_ML_gene_trees.sh: line 92: qsub: command not found
jobrange=1001-2000
/scratch/clamchatka/Panta/pantagruel/scripts/pipeline/pantagruel_pipeline_06.1_HPC_full_ML_gene_trees.sh: line 92: qsub: command not found
jobrange=2001-3000
/scratch/clamchatka/Panta/pantagruel/scripts/pipeline/pantagruel_pipeline_06.1_HPC_full_ML_gene_trees.sh: line 92: qsub: command not found
jobrange=3001-3968
/scratch/clamchatka/Panta/pantagruel/scripts/pipeline/pantagruel_pipeline_06.1_HPC_full_ML_gene_trees.sh: line 92: qsub: command not found

Or here:

/scratch/clamchatka/Panta/pantagruel/scripts/pipeline/pantagruel_pipeline_06.1_HPC_full_ML_gene_trees.sh --mem 32 --ncpus 4 --wth 24 --hpctype 'LSF' /scratch/clamchatka/Panta/test9/environ_pantagruel_test9.sh --resume

will request 32 GB for submitted jobs
will run jobs using 4 parallel threads
will request 24 hours walltime for submitted jobs
will use job submission syntax of HPC system: 'LSF'
will resume computation of task where it was last stopped
will re-use previously established lists of gene families for which to compute trees
cdsfam2phylo=/scratch/clamchatka/Panta/test9/06.gene_trees/cdsfams_minsize4 ; mem_per_core=32gb ; walltime=24:00:00 ; ncpus=4
Resume task 6: 3968 ML trees left to infer
jobrange=1-1000
/scratch/clamchatka/Panta/pantagruel/scripts/pipeline/pantagruel_pipeline_06.1_HPC_full_ML_gene_trees.sh: line 105: bsub: command not found
jobrange=1001-2000
/scratch/clamchatka/Panta/pantagruel/scripts/pipeline/pantagruel_pipeline_06.1_HPC_full_ML_gene_trees.sh: line 105: bsub: command not found
jobrange=2001-3000
/scratch/clamchatka/Panta/pantagruel/scripts/pipeline/pantagruel_pipeline_06.1_HPC_full_ML_gene_trees.sh: line 105: bsub: command not found
jobrange=3001-3968
/scratch/clamchatka/Panta/pantagruel/scripts/pipeline/pantagruel_pipeline_06.1_HPC_full_ML_gene_trees.sh: line 105: bsub: command not found
flass commented 4 years ago

Hi Laetitia, and happy new year too!

yes these scripts are not going to work if your cluster scheduling system is neither PBS/Torque nor LSF (Intel)... so I guess these would need to be adapted! If you want to get your hands dirty I suppose it would be not too much work, just a matter of replacing the scheduler-specific commands in the submission ('HPC') scripts; I'm not familiar with Slurm but I gather it would involve trading command qsub or bsub for sbatch and corresponding options. And same thing with replacing the scheduler-specific flags $PBS_SOMETHING or $LSB_SOMETHING in the batch scripts called by the HPC scripts (those ending .qsub or .bsub).

Sorry I cannot provide better at the moment :-/

Best,

Florent

megaptera-helvetiae commented 4 years ago

Thanks Florent.

I am taking the challenge and I will try to adapt the script on Friday and then when it runs post it here.

:)

flass commented 4 years ago

Good luck and yes please let me (and other users!) know!

megaptera-helvetiae commented 4 years ago

OK. Here we go. I am exhausted.

I started with the following script: pantagruel_pipeline_06.1_HPC_full_ML_gene_trees.sh After my changes it looks like this (pantagruel_pipeline_06.1_HPC_full_ML_gene_trees_SLURM.sh):

#!/bin/bash

#########################################################
## PANTAGRUEL:                                         ##
##             a pipeline for                          ##
##             phylogenetic reconciliation             ##
##             of a bacterial pangenome                ##
#########################################################

# Copyright: Florent Lassalle (f.lassalle@imperial.ac.uk), 15 Jan 2019

# environment variables to be passed on to interface script:
export hpcscript=$(basename ${0})
hpcscriptdir=$(dirname ${0})
export defncpus='auto-configure'
export defmem='auto-configure'
export defwth='auto-configure'
# export defhpctype='PBS'
# Here I commented out the export defhpctype='PBS'!
export defchunksize=1000
export withpython='true'

source ${hpcscriptdir}/pantagruel_pipeline_HPC_common_interface.sh "${@}"

#############################################################
## 06.1 Full ML gene trees on HPC
#############################################################

mkdir -p ${ptglogs}/raxml/gene_trees/ ${mlgenetrees}/

## compute first pass of gene trees with RAxML, using rapid bootstrap to estimate branch supports
if [[ "${resumetask}" == "true" && ! -z "$(ls ${genetrees}/*cdsfams_* 2> /dev/null | grep -v 'aln_list')" ]] ; then
  echo "will re-use previously established lists of gene families for which to compute trees" 
  allcdsfam2phylo=($(ls ${genetrees}/*cdsfams_* | grep -v 'aln_list'))
else
  rm -f ${genetrees}/cdsfams_*_aln_list
  sortminsizes=($(for famset in `ls ${genetrees}/*cdsfams_*` ; do echo $famset | sed -e 's/.*cdsfams_minsize\([0-9]\+.*\)/\1/' ; done | sort -n))
  allcdsfam2phylo=($(for famminsize in ${sortminsizes[@]} ; do ls ${genetrees}/*${famminsize} ; done))
fi

allmems=(4 8 32 64)
allwalltimes=(12 24 48 72)
allncpus=(4 4 4 16)

for i in ${!allcdsfam2phylo[@]} ; do
  # iterate over lists of family classed by sizes and find matching setting
  [ ${i} -gt 3 ] && i=3 # max 
  cdsfam2phylo=${allcdsfam2phylo[$i]}
  # if their value was left to their default 'auto-configure', these variables are defined based on the size of the gene family
  [ "${mem}" == 'auto-configure' ] && mem=${allmems[$i]}
  [ "${wth}" == 'auto-configure'  ] && wth=${allwalltimes[$i]}
  [ "${ncpus}" == 'auto-configure'  ] && ncpus=${allncpus[$i]}
  echo "cdsfam2phylo=${cdsfam2phylo} ; mem_per_core=${mem}gb ; walltime=${wth}:00:00 ; ncpus=${ncpus}"
  tasklist=${cdsfam2phylo}_aln_list
  rm -f $tasklist ; for fam in `cut -f1 $cdsfam2phylo` ; do ls ${cdsalifastacodedir}/${fam}.codes.aln >> $tasklist ; done
  if [ "$(wc -l $cdsfam2phylo | cut -f1 -d' ')" -lt "$(wc -l $tasklist | cut -f1 -d' ')" ] ; then 
    >&2 echo "ERROR $(dateprompt): missing gene family alignments; please fix the list '$tasklist' or the content of folder '$alifastacodedir/' before continuing."
    exit 1
  fi

  rm -f ${tasklist}*
  for fam in $(cut -f1 ${cdsfam2phylo}) ; do
    aln=${cdsalifastacodedir}/${fam}.codes.aln
    if [[ "${resumetask}" == "true" ]] ; then
      if [[ ! -s ${mlgenetrees}/${mainresulttag}/RAxML_${mainresulttag}.${fam}.codes ]] ; then
        ls ${aln} >> ${tasklist}_resume
      fi
    fi
    ls ${aln} >> ${tasklist}
  done
  if [[ "${resumetask}" == "true" ]] ; then
    if [[ -s ${tasklist}_resume ]] ; then
      echo "Resume task 6: $(wc -l ${tasklist}_resume | cut -d' ' -f1) ML trees left to infer"
    else
      echo "Resume task 6: all ML tree built; skip ML tree building"
    fi
    tasklist=${tasklist}_resume
  fi

  qsubvars="tasklist=${tasklist},outputdir=${mlgenetrees},reducedaln=true,nbthreads=${ncpus}"
  if [ ! -z "${raxmlbin}" ] ; then
    qsubvars="${qsubvars},raxmlbin=${raxmlbin}"
  fi
  Njob=`wc -l ${tasklist} | cut -f1 -d' '`
  [ ! -z ${topindex} ] &&  [ ${Njob} -gt ${topindex} ] && Njob=${topindex}

  # accomodate with possible upper limit on number of tasks in an array job
  jobranges=($(${ptgscripts}/get_jobranges.py $chunksize $Njob))
  for jobrange in ${jobranges[@]} ; do
    #echo "jobrange=$jobrange"
          sbatch -a $jobrange -N 1 -n ${ncpus} --mem=${mem}G -J raxml_gene_trees_$(basename $cdsfam2phylo) -e ${ptglogs}/raxml/gene_trees -o ${ptglogs}/raxml/gene_trees --export="$qsubvars" ${ptgscripts}/raxml_array_PBS.qsub
# Here I commented out everything below and replaced it with the slurm command just here above in line 91!

    #case "$hpctype" in
      #'PBS') 
       # qsub -J $jobrange -l walltime=${wth}:00:00 -l select=1:ncpus=${ncpus}:mem=${mem}gb -N raxml_gene_trees_$(basename $cdsfam2phylo) \
       # -o ${ptglogs}/raxml/gene_trees -j oe -v "$qsubvars" ${ptgscripts}/raxml_array_PBS.qsub

       # ;;
     # 'LSF')
       # if [ ${wth} -le 12 ] ; then
         # bqueue='normal'
        #elif [ ${wth} -le 48 ] ; then
         # bqueue='long'
        #else
        #  bqueue='basement'
       # fi
       # memmb=$((${mem} * 1024)) 
        #nflog="${ptglogs}/raxml/gene_trees/raxml_gene_trees_$(basename $cdsfam2phylo).%J.%I.o"
        #bsub -J "raxml_gene_trees_$(basename $cdsfam2phylo)[$jobrange]" -q ${bqueue} -R "select[mem>${memmb}] rusage[mem=${memmb}] span[hosts=1]" \
        #-n${ncpus} -M${memmb} -o ${nflog} -e ${nflog} -env "$qsubvars" ${ptgscripts}/raxml_array_LSF.bsub
        #;;
      #*)
       # echo "Error: high-performance computer system '$hpctype' is not supported; exit now"
        #exit 1;;
    #esac
  done
done

Briefly, I commented out all the PBS commands and added one line with the following slurm command:

sbatch -a $jobrange -N 1 -n ${ncpus} --mem=${mem}G -J raxml_gene_trees_$(basename $cdsfam2phylo) -e ${ptglogs}/raxml/gene_trees -o ${ptglogs}/raxml/gene_trees --export="$qsubvars" ${ptgscripts}/raxml_array_PBS.qsub

If you just echo the slurm command in the new pantagruel_pipeline_06.1_HPC_full_ML_gene_trees_SLURM.sh script you get this command:

sbatch -a 1-1000 -N 1 -n 4 --mem=32G -J raxml_gene_trees_cdsfams_minsize4 -e /scratch/clamchatka/Panta/test9/logs/raxml/gene_trees/test.error -o /scratch/clamchatka/Panta/test9/logs/raxml/gene_trees/test --export="tasklist=/scratch/clamchatka/Panta/test9/06.gene_trees/cdsfams_minsize4_aln_list,outputdir=/scratch/clamchatka/Panta/test9/06.gene_trees/raxml_trees,reducedaln=true,nbthreads=4" /scratch/clamchatka/Panta/pantagruel/scripts/raxml_array_PBS.qsub

You also have to change the following script: raxml_array_PBS.qsub. At line 61 change it to:

nfaln=`awk "NR==$SLURM_ARRAY_TASK_ID" $tasklist`

The whole script now looks like this:

#!/bin/bash
#PBS -S /bin/bash
#PBS -N raxml
#PBS -o /work/flassall/logs/mrbayes
#PBS -j oe 
#PBS -l walltime=24:00:00 
#PBS -l select=1:ncpus=4:mem=16gb

# load potential modules
if [ ! -z "${modulefile}" ] ; then
  source ${modulefile}
fi
# environment of Imperial College HPC cluster:
#module -v load raxml
#module -v load intel-suite
#module list

# parse external variables pased through qsub -v "var1=toto,var2=tutu,..."
# mandatory variables
if [ -z $tasklist ] ; then
  echo "!!! ERROR : must provide a task list through 'tasklist' variable ; exit now"
  exit 1
fi
if [ -z $outputdir ] ; then
  echo "!!! ERROR : must provide a output directory path through 'outputdir' variable ; exit now"
  exit 1
fi
# options
if [ -z $model ] ; then
  model='GTRCAT'
fi
if [ -z $mainresulttag ] ; then
  #~ mainresulttag='bipartitions'
  mainresulttags=('bipartitions' 'rootedTree' 'identical_sequences')
else
  mainresulttags=($mainresulttag)
fi
if [ -z $bootstrapalgo ] ; then
  bootstrapalgo='x'
fi
if [ -z $nbthreads ] ; then
  nbthreads=4
fi
if [ -z $reducedaln ] ; then
  reducedaln=false
fi

for mainresulttag in bulk ${mainresulttags[@]} ; do
  mkdir -p $outputdir/$mainresulttag/
  if [ ! -d $outputdir/$mainresulttag ] ; then 
    echo "!!! ERROR : unable to access output directory 'outputdir/$mainresulttag/' ; exit now"
    exit 1
  fi
done
if [ ! -e $tasklist ] ; then 
  echo "!!! ERROR : unable to access task list file '$tasklist' ; exit now"
  exit 1
fi
#`echo $SLURM_ARRAY_TASK_ID `
nfaln=`awk "NR==$SLURM_ARRAY_TASK_ID" $tasklist`
nfrad1=$(basename $nfaln)
nfext=${nfrad1##*.}
nfrad2=${nfrad1%.*}
echo $nfrad2

[ -x "${TMPDIR}" ] && cd ${TMPDIR} || cd /tmp/
echo "current directory is ${PWD}"

rsync -az $nfaln $TMPDIR/
if [ $? != 0 ] ; then
  echo "!!! ERROR : unable to copied input file $nfaln into $TMPDIR/ ; exit now"
  exit 1
else
  echo "copied input files $nfaln succesfully"
fi
echo "ls ./"
ls ./

# convert file format
if [ "$nfext" == 'nex' ] ; then
  module load python
  python2.7 -c "from Bio import AlignIO ; AlignIO.convert('$nfrad1', 'nexus', '${nfrad2}.fasta', 'fasta')"
  if [ $? == 0 ] ; then
    echo "succesfully converted Nexus input file $nfrad1 into FASTA format: ${nfrad2}.fasta"
    localn=${nfrad2}.fasta
    rsync -az $localn $outputdir/bulk/
  else
    echo "failed conversion of input file $nfrad1 into FASTA format; we'll see if it goes through..."
  fi
else
  localn=${nfrad1}
fi

# test presence of SSE3/AVX/AVX2 instruction support
if [[ ! -z "$(grep -o avx2 /proc/cpuinfo | head -n 1)" && ! -z "$(which raxmlHPC-PTHREADS-AVX2)" ]] ; then
  raxmlflav='-AVX2 -U'
elif [[ ! -z "$(grep -o avx /proc/cpuinfo | head -n 1)" && ! -z "$(which raxmlHPC-PTHREADS-AVX)" ]] ; then
  raxmlflav='-AVX -U'
elif [[ ! -z "$(grep -o sse3 /proc/cpuinfo | head -n 1)" && ! -z "$(which raxmlHPC-PTHREADS-SSE3)" ]] ; then
  raxmlflav='-SSE3 -U'
else
  raxmlflav=''
fi

raxmlbin="raxmlHPC-PTHREADS${raxmlflav} -T $nbthreads"

raxmloptions="-n $nfrad2 -m $model -p 1753"

idseqgreppat='exactly identical$'
idseqsedpat='s/IMPORTANT WARNING: Sequences \(.\+\) and \(.\+\) are exactly identical/\1\t\2/g'
if [[ "$reducedaln" == "true" ]] ; then
  ## reduce the alignement and record which sequences were seen as duplicates
  raxmlcall0="$raxmlbin -s $localn ${raxmloptions} -f c && grep '$idseqgreppat' RAxML_info.${nfrad2} | sed -e '$idseqsedpat' > RAxML_identical_sequences.${nfrad2}"
else
  raxmlcall0="# NOT reducing alignment to unique sequences before start"
  raxmlbin="$raxmlbin --silent"
fi
## search for global ML tree
raxmlcall1="$raxmlbin -s $localn $raxmloptions"
# search for {rapid|parametric} bootstrap trees
raxmlcall2="$raxmlbin -s $localn $raxmloptions -${bootstrapalgo} 987987 -N 100"
## map bootstraps on ML tree
raxmlcall3="$raxmlbin -s $localn $raxmloptions -f b -z RAxML_bootstrap.$nfrad2 -t RAxML_bestTree.$nfrad2"
## root ML tree
raxmlcall4="$raxmlbin -s $localn $raxmloptions -f I -t RAxML_bipartitionsBranchLabels.$nfrad2"
## generic end
raxmlcallz=""

### pipeline
raxmlcalls=(raxmlcall0 raxmlcall1 raxmlcall2 raxmlcall3 raxmlcall4 raxmlcallz)

status=$?
for i in {0..5} ; do
  let j=$i-1
  if [ $status != 0 ] ; then
    echo "!!! ERROR : during former RAxML call ; exit now"
    exit 1
  else  
    if [ -e "RAxML_info.$nfrad2" ] ; then 
      mv -f RAxML_info.$nfrad2 RAxML_info.$j.$nfrad2 
    fi
    eval raxmlcall='$'${raxmlcalls[$i]}
    echo ""
    echo "#####"
    echo $raxmlcall
    eval $raxmlcall
    status=$?
    if [ $i -eq 0 ] ; then
      ## change alignment to be further considered by RAxML as the reduced one: !!! the resulting tree will NOT contain any duplicates !!!
      if [[ -e ${localn}.reduced ]] ; then
        nbnrseq=$(head -n1 ${localn}.reduced | cut -d' ' -f1)
        if [[ $nbnrseq -lt 4 ]] ; then
          echo "WARNING: Reduced alignment is too small to pass further steps ; copy the identical_sequences file to '$outputdir/identical_sequences/' and quit"
          rsync -avz RAxML_identical_sequences.${nfrad2} ${outputdir}/identical_sequences/
          echo ${nfrad2} > ${outputdir}/bulk/${nfrad2}.smallreducedali
          exit 0
        else
          echo "# Found $(wc -l RAxML_identical_sequences.${nfrad2} | cut -d':' -f2) redundant sequences; replace input alignment '${localn}' by '${localn}.reduced'."
          repllocaln="mv ${localn} ${localn}.full ; mv ${localn}.reduced ${localn}"
          echo $repllocaln
          eval $repllocaln
        fi
      fi
    fi
  fi
done
echo "output of RAxML phylogenetic reconstruction is :"
echo "ls ./*.$nfrad2*"
ls ./*.$nfrad2*

rsync -avz --exclude=$nfrad1 --exclude=RAxML_info* ./*$nfrad2 $outputdir/bulk/
if [ $? != 0 ] ; then
  echo "!!! ERROR : unable to copy RAxML_* output files from $HOSTNAME:$PWD ; exit now"
  exit 1
else
  echo "copied RAxML_* output files with exit status $?"
fi
echo ""
for mainresulttag in ${mainresulttags[@]} ; do
  echo "mv -f $outputdir/bulk/RAxML_${mainresulttag}.$nfrad2 $outputdir/$mainresulttag/"
  mv -f $outputdir/bulk/RAxML_${mainresulttag}.$nfrad2 $outputdir/$mainresulttag/
done

Now test it with:

sbatch -a 3001-3005 -N 1 -n 4 --mem=32G -J raxml_gene_trees_cdsfams_minsize4 -e /scratch/clamchatka/Panta/test9/logs/raxml/gene_trees/error2.err -o /scratch/clamchatka/Panta/test9/logs/raxml/gene_trees/test2 --export="tasklist=/scratch/clamchatka/Panta/test9/06.gene_trees/cdsfams_minsize4_aln_list,outputdir=/scratch/clamchatka/Panta/test9/06.gene_trees/raxml_trees,reducedaln=true,nbthreads=4" /scratch/clamchatka/Panta/pantagruel/scripts/raxml_array_PBS.qsub

--> Make sure you first call up raxml with module load raxml/

To run the whole thing you would run:

/scratch/clamchatka/Panta/pantagruel/scripts/pipeline/pantagruel_pipeline_06.1_HPC_full_ML_gene_trees_SLURM.sh --mem 32 --ncpus 4 --wth 24 /scratch/clamchatka/Panta/test9/environ_pantagruel_test9.sh
megaptera-helvetiae commented 4 years ago

This all worked pretty nicely. However, here comes trouble.

Here is the full error I get when running the previous solution:

[wilkins@gorilla Panta]$ cat /scratch/clamchatka/Panta/test9/logs/raxml/gene_trees/error77.2
/var/spool/slurm/slurmd/job6678871/slurm_script: line 59: 3005: command not found
/var/spool/slurm/slurmd/job6678875/slurm_script: line 59: 3004: command not found
which: no raxmlHPC-PTHREADS-AVX2 in (/apps/python3/3.7.0/bin/:/apps/python2/2.7.15/bin:/apps/lua/5.3.4/bin:/apps/java/11.0.4/bin:/apps/gcc/9.1.0/bin:/apps/R/3.6.1/bin/:/apps/perl/5.28.0/bin/:/apps/cube/default/bin:/apps/slurm/19.05.3-2-lua/bin:/apps/module/4.3.1/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/user/wilkins/bin)
which: no raxmlHPC-PTHREADS-AVX in (/apps/python3/3.7.0/bin/:/apps/python2/2.7.15/bin:/apps/lua/5.3.4/bin:/apps/java/11.0.4/bin:/apps/gcc/9.1.0/bin:/apps/R/3.6.1/bin/:/apps/perl/5.28.0/bin/:/apps/cube/default/bin:/apps/slurm/19.05.3-2-lua/bin:/apps/module/4.3.1/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/user/wilkins/bin)
which: no raxmlHPC-PTHREADS-SSE3 in (/apps/python3/3.7.0/bin/:/apps/python2/2.7.15/bin:/apps/lua/5.3.4/bin:/apps/java/11.0.4/bin:/apps/gcc/9.1.0/bin:/apps/R/3.6.1/bin/:/apps/perl/5.28.0/bin/:/apps/cube/default/bin:/apps/slurm/19.05.3-2-lua/bin:/apps/module/4.3.1/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/user/wilkins/bin)
/var/spool/slurm/slurmd/job6678871/slurm_script: line 147: raxmlHPC-PTHREADS: command not found
which: no raxmlHPC-PTHREADS-AVX2 in (/apps/python3/3.7.0/bin/:/apps/python2/2.7.15/bin:/apps/lua/5.3.4/bin:/apps/java/11.0.4/bin:/apps/gcc/9.1.0/bin:/apps/R/3.6.1/bin/:/apps/perl/5.28.0/bin/:/apps/cube/default/bin:/apps/slurm/19.05.3-2-lua/bin:/apps/module/4.3.1/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/user/wilkins/bin)
which: no raxmlHPC-PTHREADS-AVX in (/apps/python3/3.7.0/bin/:/apps/python2/2.7.15/bin:/apps/lua/5.3.4/bin:/apps/java/11.0.4/bin:/apps/gcc/9.1.0/bin:/apps/R/3.6.1/bin/:/apps/perl/5.28.0/bin/:/apps/cube/default/bin:/apps/slurm/19.05.3-2-lua/bin:/apps/module/4.3.1/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/user/wilkins/bin)
which: no raxmlHPC-PTHREADS-SSE3 in (/apps/python3/3.7.0/bin/:/apps/python2/2.7.15/bin:/apps/lua/5.3.4/bin:/apps/java/11.0.4/bin:/apps/gcc/9.1.0/bin:/apps/R/3.6.1/bin/:/apps/perl/5.28.0/bin/:/apps/cube/default/bin:/apps/slurm/19.05.3-2-lua/bin:/apps/module/4.3.1/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/user/wilkins/bin)
/var/spool/slurm/slurmd/job6678875/slurm_script: line 147: raxmlHPC-PTHREADS: command not found

To see what is wrong I also went into the current pantagruel working directory and ran the raxml comment directly:

cd /scratch/clamchatka/Panta/test9/02.gene_alignments/full_protfam_alignments_species_code
raxmlHPC-PTHREADS -T 4 -s PANTAGC003882.codes.aln -n PANTAGC003882.codes -m GTRCAT -p 1753 -f c

I get the following error:

#####
raxmlHPC-PTHREADS -T 4 -s PANTAGC003882.codes.aln -n PANTAGC003882.codes -m GTRCAT -p 1753 -f c && grep 'exactly identical$' RAxML_info.PANTAGC003882.codes | sed -e 's/IMPORTANT WARNING: Sequences \(.\+\) and \(.\+\) are exactly identical/\1\t\2/g' > RAxML_identical_sequences.PANTAGC003882.codes
!!! ERROR : during former RAxML call ; exit now

#####
raxmlHPC-PTHREADS -T 4 -s PANTAGC003807.codes.aln -n PANTAGC003807.codes -m GTRCAT -p 1753 -f c && grep 'exactly identical$' RAxML_info.PANTAGC003807.codes | sed -e 's/IMPORTANT WARNING: Sequences \(.\+\) and \(.\+\) are exactly identical/\1\t\2/g' > RAxML_identical_sequences.PANTAGC003807.codes
!!! ERROR : during former RAxML call ; exit now
PANTAGC003892.codes
current directory is /tmp/slurm-6677634
copied input files /scratch/clamchatka/Panta/test9/02.gene_alignments/full_cdsfam_alignments_species_code/PANTAGC003892.codes.aln succesfully
ls ./
PANTAGC003892.codes.aln

#####
raxmlHPC-PTHREADS -T 4 -s PANTAGC003892.codes.aln -n PANTAGC003892.codes -m GTRCAT -p 1753 -f c && grep 'exactly identical$' RAxML_info.PANTAGC003892.codes | sed -e 's/IMPORTANT WARNING: Sequences \(.\+\) and \(.\+\) are exactly identical/\1\t\2/g' > RAxML_identical_sequences.PANTAGC003892.codes
!!! ERROR : during former RAxML call ; exit now

RAxML can't, parse the alignment file as phylip file 
it will now try to parse it as FASTA file

ERROR: Bad base (L) at site 4 of sequence 1

I will stop here. I think I figured out the slurm feed. However, now raxml is bugging me. Florent, do you know what is going on here? Have you had this problem before?

What do you suggest I do now?