Closed rdocking closed 6 years ago
Rod; Thanks for the report and apologies about the issues. For the vcfanno error, we've been working on debugging that one here and I'm a bit stuck on what is going on because I can't reproduce. There are a couple of things to check in this message and having an additional data point would be a big help:
https://groups.google.com/d/msg/biovalidation/Zdpuza0vJdI/fegttwhOAQAJ
For parallelization, is this single machine or a multiple machine cluster run? Right now these steps parallelize by sample first which makes subsequent parallelization of the actual variant calling on a machine tricky. If multiple machine I'll dig more and may have missed something in the parallelization here. Thank you again for the help debugging.
Ah - I've just been using a single sample on a local machine for debugging, so that may explain the parallelization thing. I'll try again later using multiple samples on our cluster to see what happens.
I'll take a look at the vcfanno thread and see if I can spot anything.
Looking at the command that was run:
[0:apply]: CalledProcessError: Command 'set -o pipefail; /projects/rdocking_prj/software/bcbio-nextgen/data/anaconda/bin/vcfanno -p 1 -lua /projects/rdocking_prj/software/bcbio-nextgen/data/genomes/Hsapiens/hg19/config/vcfanno/rnaedit.lua -base-path /projects/rdocking_prj/software/bcbio-nextgen/data/genomes/Hsapiens/hg19 /projects/rdocking_prj/software/bcbio-nextgen/data/genomes/Hsapiens/hg19/config/vcfanno/rnaedit.conf /projects/karsanscratch/rdocking/KARSANBIO-1290_holiday_setup/A34430_small/work/joint/gatk-haplotype-joint/NA12878_small_batch/NA12878_small_batch-joint-stdchrs.vcf.gz | bgzip -c > /projects/karsanscratch/rdocking/KARSANBIO-1290_holiday_setup/A34430_small/work/bcbiotx/tmpEjF3WD/NA12878_small_batch-joint-stdchrs-annotated-rnaedit.vcf.gz
I ran just the vcfanno
part, without the bgzip part of the command:
/projects/rdocking_prj/software/bcbio-nextgen/data/anaconda/bin/vcfanno -p 1 -lua /projects/rdocking_prj/software/bcbio-nextgen/data/genomes/Hsapiens/hg19/config/vcfanno/rnaedit.lua -base-path /projects/rdocking_prj/software/bcbio-nextgen/data/genomes/Hsapiens/hg19 /projects/rdocking_prj/software/bcbio-nextgen/data/genomes/Hsapiens/hg19/config/vcfanno/rnaedit.conf /projects/karsanscratch/rdocking/KARSANBIO-1290_holiday_setup/A34430_small/work/joint/gatk-haplotype-joint/NA12878_small_batch/NA12878_small_batch-joint-stdchrs.vcf.gz
Watching the output, the crash happened just at the boundary of chr22
and chrX
:
chr22 51214277 . A G 18.59 . AC=2;AF=1;AN=2;DP=2;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;MQ0=0;QD=18.59;SOR=1.609 GT:AD:DP:GQ:MMQ:PGT:PID:PL 1/1:0,1:1:3:60,0:1|1:51214277_A_G:45,3,0
chrX 224182 . G C 18.59 . AC=2;AF=1;AN=2;DP=1;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;MQ0=0;QD=18.59;SOR=1.609 GT:AD:DP:GQ:MMQ:PGT:PID:PL 1/1:0,1:1:3:60,0:1|1:224180_C_CCGGGCACGAGG:45,3,0
The chr22 record passed, it looks like it crashed on the first chrX record.
Maybe something weird in the sorting of the RADAR file?
[2018-02-02 08:52:29]{rdocking@clingen01}/projects/rdocking_prj/software/bcbio-nextgen/data/genomes/Hsapiens/hg19/editing/> gunzip -c RADAR.bed.gz | awk {'print $1'} | uniq -c
252016 chr1
113928 chr10
114020 chr11
146931 chr12
41917 chr13
92119 chr14
93728 chr15
127960 chr16
174174 chr17
36978 chr18
199959 chr19
175788 chr2
67863 chr20
22973 chr21
71641 chr22
142732 chr3
90215 chr4
106033 chr5
111432 chr6
145354 chr7
84777 chr8
99381 chr9
72 chrM
1 #chromosome
60468 chrX
4000 chrY
Here's the relevant part of the file:
chrM 15806 15807 uc004coy.2 + 3UTR 3UTR no no N N N
chrM 15809 15810 uc004coy.2 + 3UTR 3UTR no no N N N
chrM 15815 15816 uc004coy.2 + 3UTR 3UTR no no N N N
#chromosome position 1 gene strand annot1 annot2 alu? non_alu_repetitive? conservation_chimp conservation_rhesus conservation_mouse
chrX 218386 218387 uc004cpc.2 + 3UTR 3UTR yes no N N N
chrX 218422 218423 uc004cpc.2 + 3UTR 3UTR yes no N N N
chrX 218458 218459 uc004cpc.2 + 3UTR 3UTR yes no N N N
Confirmed that this fixes the issue:
data/RADAR.sorted.bed.gz: data/RADAR.bed.gz
bedtools sort \
-faidx /projects/rdocking_prj/software/bcbio-nextgen/data/genomes/Hsapiens/hg19/seq/hg19.fa.fai \
-i data/RADAR.bed.gz \
| bgzip -c > $@
I ran this new RADAR bed file through the vcfanno command and it worked, annotating potential RNA edit sites as expected.
Rod -- thanks so much for the detective work and identifying the underlying issue. I'm so confused as to why I didn't see this issue with our local install, but happy to have a fix that can get things working for everyone. I updated the prepped RADAR bed files so they're now correct and if you do:
bcbio_nextgen.py upgrade --data
it should grab the fixed versions and hopefully work cleanly from the base install. Please let us know if you run into any other issues at all and thanks again.
Thanks again! Just confirming that the pipeline ran to completion after updating the RADAR BED file.
Hi guys -
Following up on #2242 (where the issue was due to GATK 4.0.0.0 vs. 4.0.1.0) and #2189 (where I wasn't able to get the multithreaded version of HaplotypeCaller going, I'm now hitting the following issue.
GATK completes variant-calling, but I hit this error with vcfanno:
Any thoughts? This is with bcbio updated this morning to the latest dev version. On a separate note, I'm still not seeing the Spark version of HaplotypeCaller running for the RNA-Seq pipeline -
I'm able to get 32 cores going for STAR:
But only 1 for GATK:
Any thoughts? Here's the relevant bits of my system YAML file:
Thanks again for all your help!