edgardomortiz / vcf2phylip

Convert SNPs in VCF format to PHYLIP, NEXUS, binary NEXUS, or FASTA alignments for phylogenetic analysis
GNU General Public License v3.0
294 stars 85 forks source link

Hours taken to find the ploidy #31

Closed vmkalbskopf closed 3 years ago

vmkalbskopf commented 3 years ago

HI there!

My multi sample vcf file with non-header 551 lines (only 1 contig) takes hours to be processed. In fact, it has never finished. This is RNA-seq data from multiple haploid parasites in the host, so ploidy does not matter. GATK haplotypecaller was used to produce GVCF files, then they were filtered with GATK VariantFiltration, and then indels were removed with bcftools. Finally the individual sample vcf files were merged with GATK CombineVariants.

I made a version of the vcf that only has the header and the first snp line, and that still takes a few minutes and has not finished as of this writing. #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT R47 R48 R49 R50 R51 R52 PRELSG_01_v1 470 . C A,<NON_REF> . . BaseQRankSum=-2.287;DP=10;ExcessHet=3.01;MQRankSum=0;RAW_MQandDP=36000,10;ReadPosRankSum=0.728GT:AD:DP:GQ:PL:SB ./.:.:.:.:.:. ./.:8,2,0:10:10:0,10,256,24,261,275:7,1,2,0 ./.:.:.:.:.:. ./.:.:.:.:.:. ./.:.:.:.:.:. ./.:.:.:.:.:.

That is the first and only snp line in the vcf file that the script loops around. I uncommented the print statements for 'missing', 'broken[j], etc, bu they never print.

edgardomortiz commented 3 years ago

It sounds like you are using an old version, could you re-clone and use the latest? Please let me know if the error persists. I simplified it at some point, so now is basically ploidy-agnostic, so you can even mix samples with different ploidies within the same VCF

Edgardo

edgardomortiz commented 3 years ago

From you VCF snippet, can I assume that there is a return line after R52? If so, I can see your VCF has something I have not seen yet : <NON_REF> in the genotypes, that would definitely break the script. Do you know what software introduced the <NON_REF> genotype?

edgardomortiz commented 3 years ago

Take a look at this: https://gatk.broadinstitute.org/hc/en-us/community/posts/360062926652-Output-vcf-file-from-HaplotypeCaller-mostlly-contain-NON-REF-alteration

Perhaps you could use bcftools to merge your VCFs

vmkalbskopf commented 3 years ago

Yes, there is a return after the R52.

NON_REFwas introduced by HaplotypeCaller, the very first step. Since it wasn't caused by merging, do I need to merge with bcftools instead?

Here's a line from that first file: ##source=HaplotypeCaller #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT R47 PRELSG_01_v1 1 . C <NON_REF> . . END=341 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0 I'm using version 2.3. The latest release version. Do I still need to try again with the latest devel version?

vmkalbskopf commented 3 years ago

According to the link, it uses non-ref because it is a GVCF file. So I guess you don't support that? Btw, thanks for the very quick response!

edgardomortiz commented 3 years ago

Re-clone the repository or download release 2.5, but I think your problem is that you are provinding a GVCF (an format of Broad Institute) in the link I sent you they show how to normalize it to a regular VCF.

vmkalbskopf commented 3 years ago

Ok, I'll do that. Thanks

Perhaps a good thing to check for now :-)

edgardomortiz commented 3 years ago

I could also add support if ignoring <NON_REF> is the only major difference. Would you mind emailing me a few hundred lines of your VCF?

vmkalbskopf commented 3 years ago

Sure!