bcgsc / mavis

Merging, Annotation, Validation, and Illustration of Structural variants
http://mavis.bcgsc.ca
GNU General Public License v3.0
72 stars 13 forks source link

Trouble running full Mavis tutorial on SLURM #212

Closed moldach closed 4 years ago

moldach commented 4 years ago

MAVIS version: 2.2.6

Python version: 3.8.2

OS: CentOS Linux 7

I'm running into issues trying to run Mavis on the SLURM scheduler.

I'm getting errors when trying the Mavis (Full) tutorial which is (supposed) to submit to a SLURM scheduler.

Getting tutorial data

wget http://www.bcgsc.ca/downloads/mavis/tutorial_data.tar.gz
tar -xvzf tutorial_data.tar.gz
# Downloading reference inputs
wget https://raw.githubusercontent.com/bcgsc/mavis/master/tools/get_hg19_reference_files.sh
bash get_hg19_reference_files.sh

Generating the Config file

source reference_inputs/hg19_env.sh
salloc –time=3:0:0 –mem=6000
mavis config \
    --library L1522785992-normal genome normal False tutorial_data/L1522785992_normal.sorted.bam \
    --library L1522785992-tumour genome diseased False tutorial_data/L1522785992_tumour.sorted.bam \
    --library L1522785992-trans transcriptome diseased True tutorial_data/L1522785992_trans.sorted.bam \
    --convert breakdancer tutorial_data/breakdancer-1.4.5/*txt breakdancer \
    --convert breakseq tutorial_data/breakseq-2.2/breakseq.vcf.gz breakseq \
    --convert chimerascan tutorial_data/chimerascan-0.4.5/chimeras.bedpe chimerascan \
    --convert defuse tutorial_data/defuse-0.6.2/results.classify.tsv defuse \
    --convert manta tutorial_data/manta-1.0.0/diploidSV.vcf.gz tutorial_data/manta-1.0.0/somaticSV.vcf manta \
    --assign L1522785992-trans chimerascan defuse \
    --assign L1522785992-tumour breakdancer breakseq manta  \
    --assign L1522785992-normal breakdancer breakseq manta \
    -w mavis.cfg

Setting up the pipeline

mavis setup mavis.cfg -o output_dir/

Submitting Jobs to the Cluster

mavis schedule -o output_dir/ --submit

Looking at squeue immediately I see:

(mavis) [moldach@cdr527 mavis-test]$ squeue -u moldach
          JOBID     USER      ACCOUNT           NAME  ST  TIME_LEFT NODES CPUS TRES_PER_N MIN_MEM NODELIST (REASON)
       44188077  moldach def-mtarailo             sh   R    7:56:56     1    1        N/A    256M cdr527 (None)
     44188145_1  moldach def-mtarailo MV_L1522785992   R   15:59:56     1    1        N/A  18000M cdr811 (None)
44188141_[1-108  moldach def-mtarailo MV_L1522785992  PD   16:00:00     1    1        N/A  16000M  (Priority)
44188143_[1-108  moldach def-mtarailo MV_L1522785992  PD   16:00:00     1    1        N/A  16000M  (Priority)
44188147_[1-108  moldach def-mtarailo MA_L1522785992  PD   16:00:00     1    1        N/A  12000M  (Dependency)
     44188162_1  moldach def-mtarailo MA_L1522785992  PD   16:00:00     1    1        N/A  12000M  (Dependency)
44188159_[1-108  moldach def-mtarailo MA_L1522785992  PD   16:00:00     1    1        N/A  12000M  (Dependency)
       44188179  moldach def-mtarailo MP_batch-mFDMV  PD   16:00:00     1    1        N/A  16000M  (Dependency)
       44188183  moldach def-mtarailo MS_batch-mFDMV  PD   16:00:00     1    1        N/A  16000M  (Dependency)

However, soon after, I see a non-zero exit code:

(mavis) [moldach@cdr527 biostars439754]$ squeue -u moldach
          JOBID     USER      ACCOUNT           NAME  ST  TIME_LEFT NODES CPUS TRES_PER_N MIN_MEM NODELIST (REASON)
       44188077  moldach def-mtarailo             sh   R    7:52:25     1    1        N/A    256M cdr527 (None)
44188143_[1-108  moldach def-mtarailo MV_L1522785992  PD   16:00:00     1    1        N/A  16000M  (Priority)
44188159_[1-108  moldach def-mtarailo MA_L1522785992  PD   16:00:00     1    1        N/A  12000M  (Dependency)
44188141_[21-10  moldach def-mtarailo MV_L1522785992  PD   16:00:00     1    1        N/A  16000M  (Priority)
44188147_[7,21-  moldach def-mtarailo MA_L1522785992  PD   16:00:00     1    1        N/A  12000M  (Dependency)
     44188141_7  moldach def-mtarailo MV_L1522785992  CG   15:58:58     1    1        N/A  16000M cdr787 (NonZeroExitCode)

Running the following to check the jobs:

(mavis) [moldach@cdr527 mavis-test]$ mavis schedule -o output_dir
                      MAVIS: 2.2.6
                      hostname: cdr527.int.cedar.computecanada.ca
[2020-06-17 09:36:14] arguments
                        command = 'schedule'
                        log = None
                        log_level = 'INFO'
                        output = 'output_dir'
                        resubmit = False
                        submit = False
[2020-06-17 09:36:15] validate
                        MV_L1522785992-normal_batch-mFDMVoNMPeFmRcZqbfVSqx (44188141) is FAILED
                          108 tasks are FAILED
                        MV_L1522785992-tumour_batch-mFDMVoNMPeFmRcZqbfVSqx (44188143) is FAILED
                          108 tasks are FAILED
                        MV_L1522785992-trans_batch-mFDMVoNMPeFmRcZqbfVSqx (44188145) is FAILED
                          1 task is FAILED
[2020-06-17 09:36:15] annotate
                        MA_L1522785992-normal_batch-mFDMVoNMPeFmRcZqbfVSqx (44188147) is CANCELLED
                          108 tasks are CANCELLED
                        MA_L1522785992-tumour_batch-mFDMVoNMPeFmRcZqbfVSqx (44188159) is CANCELLED
                          108 tasks are CANCELLED
                        MA_L1522785992-trans_batch-mFDMVoNMPeFmRcZqbfVSqx (44188162) is CANCELLED
                          1 task is CANCELLED
[2020-06-17 09:36:15] pairing
                        MP_batch-mFDMVoNMPeFmRcZqbfVSqx (44188179) is CANCELLED
                          missing log file: /scratch/moldach/mavis-test/output_dir/pairing/job-MP_batch-mFDMVoNMPeFmRcZqbfVSqx-44188179.log
[2020-06-17 09:36:15] summary
                        MS_batch-mFDMVoNMPeFmRcZqbfVSqx (44188183) is CANCELLED
                          missing log file: /scratch/moldach/mavis-test/output_dir/summary/job-MS_batch-mFDMVoNMPeFmRcZqbfVSqx-44188183.log
                      rewriting: output_dir/build.cfg

Shows that the jobs failed but I’m not sure why? I was able to run the Mavis Mini-Tutorial locally using salloc on SLURM so I don't think its a dependency issue.

I would appreciate your help trying to solve this issue.

oneillkza commented 4 years ago

I'd start out looking at the log files in output_dir/L1522785992-normal_normal_genome/validate/batch-mFDMVoNMPeFmRcZqbfVSqx-*

It sounds like all the jobs failed, so you can pick any of those directories.

moldach commented 4 years ago

Hi @oneillkza thank,

Looks like the error is due to a blat dependency (I was wrong...)

Traceback (most recent call last):
  File "/home/moldach/bin/mavis/bin/mavis", line 10, in <module>
    sys.exit(main())
  File "/home/moldach/bin/mavis/lib/python3.8/site-packages/mavis/main.py", line 297, in main
    args.aligner_version = get_aligner_version(args.aligner)
  File "/home/moldach/bin/mavis/lib/python3.8/site-packages/mavis/align.py", line 152, in get_aligner_v$
    raise ValueError("unable to parse blat version number from:'{}'".format(proc))
ValueError: unable to parse blat version number from:'/bin/sh: blat: command not found'

Now blat is available on the HPC to users via module load blat. If I run which blat I can see that the pathway is: /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/intel2016.4/blat/3.5/bin/blat

How can I tell Mavis where to look for blat?

oneillkza commented 4 years ago

Hmm -- as near as I can tell it just looks for it to be on the path. Are you running module load blat at the start of the script? I believe what that does is set up path/environment variables, so as long as those propagate it should work fine.

moldach commented 4 years ago

Hi again, I've confirmed that using module load blat on my remote server makes blat accessible in the path. Therefore including this at the start of the script solved my problem.

Thank you very much!