Closed vmkalbskopf closed 3 years ago
It sounds like you are using an old version, could you re-clone and use the latest? Please let me know if the error persists. I simplified it at some point, so now is basically ploidy-agnostic, so you can even mix samples with different ploidies within the same VCF
Edgardo
From you VCF snippet, can I assume that there is a return line after R52
?
If so, I can see your VCF has something I have not seen yet : <NON_REF>
in the genotypes, that would definitely break the script. Do you know what software introduced the <NON_REF>
genotype?
Take a look at this: https://gatk.broadinstitute.org/hc/en-us/community/posts/360062926652-Output-vcf-file-from-HaplotypeCaller-mostlly-contain-NON-REF-alteration
Perhaps you could use bcftools
to merge your VCFs
Yes, there is a return after the R52
.
NON_REF
was introduced by HaplotypeCaller, the very first step. Since it wasn't caused by merging, do I need to merge with bcftools instead?
Here's a line from that first file:
##source=HaplotypeCaller
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT R47 PRELSG_01_v1 1 . C <NON_REF> . . END=341 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
I'm using version 2.3. The latest release version. Do I still need to try again with the latest devel version?
According to the link, it uses non-ref because it is a GVCF file. So I guess you don't support that? Btw, thanks for the very quick response!
Re-clone the repository or download release 2.5, but I think your problem is that you are provinding a GVCF (an format of Broad Institute) in the link I sent you they show how to normalize it to a regular VCF.
Ok, I'll do that. Thanks
Perhaps a good thing to check for now :-)
I could also add support if ignoring <NON_REF>
is the only major difference. Would you mind emailing me a few hundred lines of your VCF?
Sure!
HI there!
My multi sample vcf file with non-header 551 lines (only 1 contig) takes hours to be processed. In fact, it has never finished. This is RNA-seq data from multiple haploid parasites in the host, so ploidy does not matter. GATK haplotypecaller was used to produce GVCF files, then they were filtered with GATK VariantFiltration, and then indels were removed with bcftools. Finally the individual sample vcf files were merged with GATK CombineVariants.
I made a version of the vcf that only has the header and the first snp line, and that still takes a few minutes and has not finished as of this writing.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT R47 R48 R49 R50 R51 R52 PRELSG_01_v1 470 . C A,<NON_REF> . . BaseQRankSum=-2.287;DP=10;ExcessHet=3.01;MQRankSum=0;RAW_MQandDP=36000,10;ReadPosRankSum=0.728GT:AD:DP:GQ:PL:SB ./.:.:.:.:.:. ./.:8,2,0:10:10:0,10,256,24,261,275:7,1,2,0 ./.:.:.:.:.:. ./.:.:.:.:.:. ./.:.:.:.:.:. ./.:.:.:.:.:.
That is the first and only snp line in the vcf file that the script loops around. I uncommented the print statements for 'missing', 'broken[j], etc, bu they never print.