Closed sxv closed 5 years ago
Sujay; Sorry about the issue here I'm not entirely sure what is going wrong. A couple of questions:
hg19
build you're using the normal one distributed with bcbio?I'm wondering if there is some kind of out of order contig thing happening in the VCF header which is confusing GATK here.
As another check, what is the last record of the chrM split VCF file:
zcat /15/mutect2/prostate_lcm_15_19-chrM_0_16571-fixheader.vcf.gz | grep -v ^# | tail -1
Sorry to not know exactly what is happening, but hope this helps some with debugging.
Got a similar problem here (with hg19, distributed with bcbio, aligned against the same hg19). This however occurs with RNA-seq variant calling (same issue on chromosome M):
The output on the problematic file is as follows.
chrM 15957 . T <NON_REF> . . END=16023 GT:DP:GQ:MIN_DP:PL 0:478:99:92:0,1800
It looks like the contigs are indeed in the wrong order.
"Right" GATK file from another project:
##contig=<ID=chrM,length=16571>
##contig=<ID=chr1,length=249250621>
##contig=<ID=chr2,length=243199373>
##contig=<ID=chr3,length=198022430>
From this one:.
##contig=<ID=chr1,length=249250621>
##contig=<ID=chr10,length=135534747>
##contig=<ID=chr11,length=135006516>
##contig=<ID=chr11_gl000202_random,length=40103>
Luca; Thanks for the report and sorry this is causing issues. Are you able to track down at all what is happening? We haven't seen this issue and I'm not sure how best to reproduce. It looks like your problem file is not sorted in GATK order but in numerical order, which typically indicates a differently sorted reference file somewhere. If you're able to provide more details or ideas about what might be happening happy to dig more.
At first I thought we were using pre-aligned files, but after a look at the configuration, it started from FASTQ files, so they were all produced during the run, using the same assembly:
grep build RNA-Seq-complete.yaml | sort -u
genome_build: hg19
For now I fixed the VCFs manually with UpdateVCFSequenceDictionary
from the GATK and then replaced them in the work
directory by bcbio.
Hi, I'm stuck on the same error:
ERROR 2019-05-01 12:15:33 GatherVcfs There was a problem with gathering the INPUT.java.lang.IllegalArgumentException: First record in file /home/bioinf/bcbio/working/freebayes/chr2/Batch1-chr2_0_13657608.vcf.gz is not after first record in previous file /home/bioinf/bcbio/working/freebayes/chr1_GL456221_random/Batch1-chr1_GL456221_random_0_205860.vcf.gz
I made gists of the vcf files (with only the first record): https://gist.github.com/davidvanzessen/644924290d9afd89b19bd5b3ffad0434
My bcbio config yml:
details:
- algorithm:
aligner: bwa
mark_duplicates: true
remove_lcr: true
variantcaller: [freebayes,vardict,varscan]
ensemble:
numpass: 2
align_split_size: false
analysis: variant2
lane: 1
description: Galaxy_1
files: [/home/bioinf/bcbio/input/normal_0_R1.fastq.gz, /home/bioinf/bcbio/input/normal_0_R2.fastq.gz]
genome_build: mm10
metadata:
phenotype: normal
batch: Batch1
- algorithm:
aligner: bwa
mark_duplicates: true
remove_lcr: true
variantcaller: [freebayes,vardict,varscan]
ensemble:
numpass: 2
align_split_size: false
analysis: variant2
lane: 2
description: Galaxy_2
files: [/home/bioinf/bcbio/input/tumor_0_R1.fastq.gz, /home/bioinf/bcbio/input/tumor_0_R2.fastq.gz]
genome_build: mm10
metadata:
phenotype: tumor
batch: Batch1
Hope someone can help, let me know if there is any more information I can provide.
Thanks for the detailed report and apologies about the issue. I'm not exactly sure how these are getting out of order, but it looks from the error message like the merge has sorted one of the extra chromosomes (chr1_GL456221_random
) before a standard chromosome (chr2
) causing the complaint. We pushed some debugging output that might help us isolate the issue if you're able to update to the latest development (bcbio_nextgen.py upgrade -u development
), re-run in place and report the full error message with debug output that comes back as a gist. Additionally, if you look at your mm10 reference fasta index (/home/bioinf/bcbio_test/genomes/Mmusculus/mm10/seq/mm10.fa.fai
), could you confirm the ordering of contigs in there matches what's in the VCF inputs?
Sorry again about the issues and thanks for the help debugging this.
The ordering of the contigs in mm10.fa.fai
is different:
https://gist.github.com/davidvanzessen/ba021e7b69b7c346d0d1e13cbd3750ca
I've upgraded to the development branch and I'm rerunning the analysis, I'll post back here when it's done.
Thank you for taking the time to help solve this :+1:
Thank you for helping debug this. The fai file is the issue, as the contigs in there are out of order with respect to what is in the bwa indices and subsequent VCF files. This mismatch ends up with us trying to order the VCFs incorrectly and the error.
If this mm10 file got created with bcbio, I think the issue is that it's not using the pre-created mm10 downloads and instead trying to create and sort them as part of the install. I pushed a fix to download and use the pre-existing ones. If you update with:
mv /home/bioinf/bcbio_test/genomes/Mmusculus/mm10/seq /home/bioinf/bcbio_test/genomes/Mmusculus/mm10/seq.problem
bcbio_nextgen.py upgrade --data
I hope it'll pull the correct ones, then you can re-run your analysis in place without needing to change anything and it should hopefully do the right thing with sorting. Please let us know if you still hit any issues.
Renaming the seq
directory and running bcbio_nextgen.py upgrade --data
gives the following error:
FileNotFoundError: [Errno 2] No such file or directory: '/home/bioinf/bcbio_test/genomes/Mmusculus/mm10/seq/mm10-resources.yaml'
I tried just creating the seq
directory with just mm10-resources.yaml
, but it then instantly jumps to Upgrade completed successfully.
without downloading anything new.
Thank you for testing this and apologies about the upgrade error. I made a mistake and forgot to wire in the new recipe for the seq
directory so it didn't actually get run, and then it errors later when expecting to find that directory. Could you try the upgrade one more time with a fresh cloudbiolinux download:
rm -rf tmpbcbio-install
bcbio_nextgen.py upgrade --data
Hope this works cleanly for you and thank you again for the help debugging.
It worked, thank you so much for the help :+1:
I just met the same error with this command:
vcf_file_string="" for vcf in $(ls -1 newcontiglines.vcfchr*.vcf) ; do vcf_file_string="$vcf_file_string -I $vcf" done
module load gatk/4.3.0.0 gatk GatherVcfs \ -R genome.fa $vcf_file_string -O dbSNP.vcf
then I ls the vcf files by numeric order, it works fine.
vcf_file_string="" for vcf in $(ls -1v newcontiglines.vcfchr*.vcf) ; do vcf_file_string="$vcf_file_string -I $vcf" done
module load gatk/4.3.0.0 gatk GatherVcfs \ -R genome.fa $vcf_file_string -O dbSNP.vcf
Hi,
I am excited to be using the new gatk4 workflow on a fresh bcbio development install. In this case I'm running a tumor/normal pair of targeted DNA-Seq from pre-aligned bam files. I've read mixed opinions about whether or not recalibration is recommended for gatk4, so advice on that would be helpful as well. Anyway, I'm getting the following error message:
Here are the first lines of the vcfs:
$ zcat /15/mutect2/chr1/prostate_lcm_15_19-chr1_109811749_142535913.vcf.gz | grep -v ^# | head -1
$ zcat /15/mutect2/prostate_lcm_15_19-chrM_0_16571-fixheader.vcf.gz | grep -v ^# | head -1
Is the error because chr1 shouldn't follow chrM, or something else going on?
Here's my config:
Thanks!