bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
990 stars 354 forks source link

RNA-seq variant calling - vcfanno and multithreading #2249

Closed rdocking closed 6 years ago

rdocking commented 6 years ago

Hi guys -

Following up on #2242 (where the issue was due to GATK 4.0.0.0 vs. 4.0.1.0) and #2189 (where I wasn't able to get the multithreaded version of HaplotypeCaller going, I'm now hitting the following issue.

GATK completes variant-calling, but I hit this error with vcfanno:

[0:apply]: CalledProcessError: Command 'set -o pipefail; /projects/rdocking_prj/software/bcbio-nextgen/data/anaconda/bin/vcfanno -p 1 -lua /projects/rdocking_prj/software/bcbio-nextgen/data/genomes/Hsapiens/hg19/config/vcfanno/rnaedit.lua -base-path /projects/rdocking_prj/software/bcbio-nextgen/data/genomes/Hsapiens/hg19 /projects/rdocking_prj/software/bcbio-nextgen/data/genomes/Hsapiens/hg19/config/vcfanno/rnaedit.conf /projects/karsanscratch/rdocking/KARSANBIO-1290_holiday_setup/A34430_small/work/joint/gatk-haplotype-joint/NA12878_small_batch/NA12878_small_batch-joint-stdchrs.vcf.gz |  bgzip -c > /projects/karsanscratch/rdocking/KARSANBIO-1290_holiday_setup/A34430_small/work/bcbiotx/tmpEjF3WD/NA12878_small_batch-joint-stdchrs-annotated-rnaedit.vcf.gz
=============================================
vcfanno version 0.2.8 [built with go1.8]
see: https://github.com/brentp/vcfanno
=============================================
vcfanno.go:115: found 1 sources from 1 files
panic: strconv.Atoi: parsing "position": invalid syntax
goroutine 13437 [running]:
github.com/brentp/irelate.newMerger(0x7cd780, 0x0, 0xc42093e080, 0x2, 0x2, 0xc42005ff58)
    /home/brentp/go/src/github.com/brentp/irelate/irelate.go:229 +0x2d7
github.com/brentp/irelate.IRelate(0x7cd778, 0x0, 0x7cd780, 0xc42093e080, 0x2, 0x2, 0xc421066090, 0x10000)
    /home/brentp/go/src/github.com/brentp/irelate/irelate.go:135 +0x69
github.com/brentp/irelate.PIRelate.func3.1(0xc420dd5ce0, 0x7cd748, 0xc420218c00, 0xc42093e080, 0x2, 0x2)
    /home/brentp/go/src/github.com/brentp/irelate/parallel.go:245 +0x7b
created by github.com/brentp/irelate.PIRelate.func3
    /home/brentp/go/src/github.com/brentp/irelate/parallel.go:269 +0x160
' returned non-zero exit status 2

Any thoughts? This is with bcbio updated this morning to the latest dev version. On a separate note, I'm still not seeing the Spark version of HaplotypeCaller running for the RNA-Seq pipeline -

I'm able to get 32 cores going for STAR:

[2018-02-01T16:59Z] ccgihpc.hpc.bcgsc.ca: Resource requests: picard, samtools, star; memory: 3.00, 3.00, 3.00; cores: 32, 32, 32
[2018-02-01T16:59Z] ccgihpc.hpc.bcgsc.ca: Configuring 2 jobs to run, using 32 cores each with 96.1g of memory reserved for each job
[2018-02-01T16:59Z] ccgihpc.hpc.bcgsc.ca: Running locally instead of distributed -- checkpoint passed: alignment

But only 1 for GATK:

[2018-02-01T16:59Z] ccgihpc.hpc.bcgsc.ca: Resource requests: gatk; memory: 3.60; cores: 1
[2018-02-01T16:59Z] ccgihpc.hpc.bcgsc.ca: Configuring 1 jobs to run, using 1 cores each with 3.60g of memory reserved for each job
[2018-02-01T17:01Z] ccgihpc.hpc.bcgsc.ca: Timing: RNA-seq variant calling

Any thoughts? Here's the relevant bits of my system YAML file:

resources:
  default:
    memory: 3.6G  # 3.6*32 ~= 115G
    cores: 32
    jvm_opts: ["-Xms800m", "-Xmx3600m"]
  gatk:
    jvm_opts: ["-Xms800m", "-Xmx3600m"]

Thanks again for all your help!

chapmanb commented 6 years ago

Rod; Thanks for the report and apologies about the issues. For the vcfanno error, we've been working on debugging that one here and I'm a bit stuck on what is going on because I can't reproduce. There are a couple of things to check in this message and having an additional data point would be a big help:

https://groups.google.com/d/msg/biovalidation/Zdpuza0vJdI/fegttwhOAQAJ

For parallelization, is this single machine or a multiple machine cluster run? Right now these steps parallelize by sample first which makes subsequent parallelization of the actual variant calling on a machine tricky. If multiple machine I'll dig more and may have missed something in the parallelization here. Thank you again for the help debugging.

rdocking commented 6 years ago

Ah - I've just been using a single sample on a local machine for debugging, so that may explain the parallelization thing. I'll try again later using multiple samples on our cluster to see what happens.

I'll take a look at the vcfanno thread and see if I can spot anything.

rdocking commented 6 years ago

Looking at the command that was run:

[0:apply]: CalledProcessError: Command 'set -o pipefail; /projects/rdocking_prj/software/bcbio-nextgen/data/anaconda/bin/vcfanno -p 1 -lua /projects/rdocking_prj/software/bcbio-nextgen/data/genomes/Hsapiens/hg19/config/vcfanno/rnaedit.lua -base-path /projects/rdocking_prj/software/bcbio-nextgen/data/genomes/Hsapiens/hg19 /projects/rdocking_prj/software/bcbio-nextgen/data/genomes/Hsapiens/hg19/config/vcfanno/rnaedit.conf /projects/karsanscratch/rdocking/KARSANBIO-1290_holiday_setup/A34430_small/work/joint/gatk-haplotype-joint/NA12878_small_batch/NA12878_small_batch-joint-stdchrs.vcf.gz |  bgzip -c > /projects/karsanscratch/rdocking/KARSANBIO-1290_holiday_setup/A34430_small/work/bcbiotx/tmpEjF3WD/NA12878_small_batch-joint-stdchrs-annotated-rnaedit.vcf.gz

I ran just the vcfanno part, without the bgzip part of the command:

/projects/rdocking_prj/software/bcbio-nextgen/data/anaconda/bin/vcfanno -p 1 -lua /projects/rdocking_prj/software/bcbio-nextgen/data/genomes/Hsapiens/hg19/config/vcfanno/rnaedit.lua -base-path /projects/rdocking_prj/software/bcbio-nextgen/data/genomes/Hsapiens/hg19 /projects/rdocking_prj/software/bcbio-nextgen/data/genomes/Hsapiens/hg19/config/vcfanno/rnaedit.conf /projects/karsanscratch/rdocking/KARSANBIO-1290_holiday_setup/A34430_small/work/joint/gatk-haplotype-joint/NA12878_small_batch/NA12878_small_batch-joint-stdchrs.vcf.gz

Watching the output, the crash happened just at the boundary of chr22 and chrX:

chr22   51214277        .       A       G       18.59   .       AC=2;AF=1;AN=2;DP=2;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;MQ0=0;QD=18.59;SOR=1.609        GT:AD:DP:GQ:MMQ:PGT:PID:PL      1/1:0,1:1:3:60,0:1|1:51214277_A_G:45,3,0
chrX    224182  .       G       C       18.59   .       AC=2;AF=1;AN=2;DP=1;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;MQ0=0;QD=18.59;SOR=1.609        GT:AD:DP:GQ:MMQ:PGT:PID:PL      1/1:0,1:1:3:60,0:1|1:224180_C_CCGGGCACGAGG:45,3,0

The chr22 record passed, it looks like it crashed on the first chrX record.

rdocking commented 6 years ago

Maybe something weird in the sorting of the RADAR file?

[2018-02-02 08:52:29]{rdocking@clingen01}/projects/rdocking_prj/software/bcbio-nextgen/data/genomes/Hsapiens/hg19/editing/> gunzip -c RADAR.bed.gz | awk {'print $1'} | uniq -c
 252016 chr1
 113928 chr10
 114020 chr11
 146931 chr12
  41917 chr13
  92119 chr14
  93728 chr15
 127960 chr16
 174174 chr17
  36978 chr18
 199959 chr19
 175788 chr2
  67863 chr20
  22973 chr21
  71641 chr22
 142732 chr3
  90215 chr4
 106033 chr5
 111432 chr6
 145354 chr7
  84777 chr8
  99381 chr9
     72 chrM
      1 #chromosome
  60468 chrX
   4000 chrY

Here's the relevant part of the file:

chrM    15806   15807   uc004coy.2      +       3UTR    3UTR    no      no      N       N       N
chrM    15809   15810   uc004coy.2      +       3UTR    3UTR    no      no      N       N       N
chrM    15815   15816   uc004coy.2      +       3UTR    3UTR    no      no      N       N       N
#chromosome     position        1       gene    strand  annot1  annot2  alu?    non_alu_repetitive?     conservation_chimp      conservation_rhesus     conservation_mouse
chrX    218386  218387  uc004cpc.2      +       3UTR    3UTR    yes     no      N       N       N
chrX    218422  218423  uc004cpc.2      +       3UTR    3UTR    yes     no      N       N       N
chrX    218458  218459  uc004cpc.2      +       3UTR    3UTR    yes     no      N       N       N
rdocking commented 6 years ago

Confirmed that this fixes the issue:

data/RADAR.sorted.bed.gz: data/RADAR.bed.gz
    bedtools sort \
    -faidx /projects/rdocking_prj/software/bcbio-nextgen/data/genomes/Hsapiens/hg19/seq/hg19.fa.fai \
    -i data/RADAR.bed.gz \
     | bgzip -c > $@

I ran this new RADAR bed file through the vcfanno command and it worked, annotating potential RNA edit sites as expected.

chapmanb commented 6 years ago

Rod -- thanks so much for the detective work and identifying the underlying issue. I'm so confused as to why I didn't see this issue with our local install, but happy to have a fix that can get things working for everyone. I updated the prepped RADAR bed files so they're now correct and if you do:

bcbio_nextgen.py upgrade --data

it should grab the fixed versions and hopefully work cleanly from the base install. Please let us know if you run into any other issues at all and thanks again.

rdocking commented 6 years ago

Thanks again! Just confirming that the pipeline ran to completion after updating the RADAR BED file.