edgardomortiz / vcf2phylip

Convert SNPs in VCF format to PHYLIP, NEXUS, binary NEXUS, or FASTA alignments for phylogenetic analysis
GNU General Public License v3.0
294 stars 85 forks source link

All of the SNPs were removed by vcf2phylip #44

Closed bjarnebartlett closed 9 months ago

bjarnebartlett commented 1 year ago

Aloha,

For some reason all my SNPs were removed after converting my vcf file. Processed on a MacOS system, the command I used is "python vcf2phylip.py -i all.vcf".

The result is: "Number of samples in VCF: 74 Total of genotypes processed: 11576930 Genotypes excluded because they exceeded the amount of missing data allowed: 11576930 Genotypes that passed missing data filter but were excluded for being MNPs: 0 SNPs that passed the filters: 0"

I used GATK to create this merged VCF file, and I suspect there's a formatting issue. To check this, I have included the first 1000 rows of the merged VCF as an attachment.

How can I fix this? Mahalo in advance for any help you can give. vcf1000.txt

-Bjarne

edgardomortiz commented 1 year ago

Hi Bjarne,

There is no formatting mistake, all your genotypes are empty (if you open the vcf you will only see ./. where the genotypes should be). This must be some processing mistake during GATK, for example see this: https://gatk.broadinstitute.org/hc/en-us/community/posts/360060957571-Empty-vcf-after-GenotypeVCFs-when-combining-already-genotyped-samples

I hope it helps

Edgardo

bjarnebartlett commented 11 months ago

Hello,

Thank you very much for your help. I am revisiting this project and I took your suggestion to look at the vcf files, I have generated a merged VCF that isn't empty. I generated this VCF using GATK and merged it using BCFtools -- I attached it for you to verify. I am now getting the error below -- I read through the repository and couldn't figure out what KeyError: 'K' might be.

Cheers!

Bjarne

`Converting file 'allbcf.vcf':

Number of samples in VCF: 395 Traceback (most recent call last): File "/mnt/md0/projects/Brettanomyces/Brett_Analysis_All_2023BB/TreeBuild_11_28_23_BB/vcf2phylip.py", line 502, in main() File "/mnt/md0/projects/Brettanomyces/Brett_Analysis_All_2023BB/TreeBuild_11_28_23_BB/vcf2phylip.py", line 314, in main site_tmp = get_matrix_column(record, num_samples, File "/mnt/md0/projects/Brettanomyces/Brett_Analysis_All_2023BB/TreeBuild_11_28_23_BB/vcf2phylip.py", line 129, in get_matrix_column column += AMBIG[geno_nuc] KeyError: 'K' `

partvcf.txt

edgardomortiz commented 11 months ago

Sorry for the delay, I took a look at your file, you have ambiguities (the K means G or T) in your reference which is very atypical but I can modify the code to skip this kind of SNPs: image

The bigger problem I see now is that you have flag which shouldn't be there in a standard VCF, check this discussion: https://gatk.broadinstitute.org/hc/en-us/community/posts/360057940352-Delete-NON-REF-from-VCF

I would recommend removing those flags or switching genotyper to something more standard like Freebayes?