edgardomortiz / vcf2phylip

Convert SNPs in VCF format to PHYLIP, NEXUS, binary NEXUS, or FASTA alignments for phylogenetic analysis
GNU General Public License v3.0
294 stars 85 forks source link

genotypes excluded even if missing data is less than -m 4 #14

Closed a2g1n closed 4 years ago

a2g1n commented 5 years ago

Hi I have a multi-sample vcf which I have filtered to retain reference and SNP calls ONLY if at least 25 samples out of 31 total samples have non-missing data. However, when I convert that into phylip using your script, genotypes are still being excluded when they should not be. Or am I understanding the -m parameter wrong? I also tried -m 0 and still facing the same problem. Is there any way to see the excluded genotypes to troubleshoot this?

vcf2phylip.py -i RMPs.vcf -m 4 Total of genotypes processed: 6372167 Genotypes excluded because they exceeded the amount of missing data allowed: 875118 Genotypes that passed missing data filter but were excluded for not being SNPs: 0 SNPs that passed the filters: 5497049 vcf2phylip.py -i RMPs.vcf -m 0 Total of genotypes processed: 6372167 Genotypes excluded because they exceeded the amount of missing data allowed: 810339 Genotypes that passed missing data filter but were excluded for not being SNPs: 0 SNPs that passed the filters: 5561828

Thanks.

edgardomortiz commented 5 years ago

Hello, My guess is that those excluded are deletions. If possible email me a few thousand lines (~10K) of your VCF to be sure or to fix the issue.

Edgardo

a2g1n commented 5 years ago

Hi Edgardo Thanks for your mail. I had filtered my variants just for SNPs, so I doubt its deletions. Anyway I have attached part of the vcf. Thanks! Also is it possible to get the vcf lines that the script excluded (or included)? It would be helpful for downstream analysis. For example, I would like to tell how many genes are covered in the RaxML phylogeny tree. But I can’t estimate it without knowing which vcf lines were included.

Regards Abhinay

On 20 Oct 2019, at 09:17, Edgardo M. Ortiz notifications@github.com wrote:

Hello, My guess is that those excluded are deletions. If possible email me a few thousand lines (~10K) of your VCF to be sure or to fix the issue.

Edgardo

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/edgardomortiz/vcf2phylip/issues/14?email_source=notifications&email_token=ANRERGRYI2HOEQKVDZ6NH33QPQH3LA5CNFSM4JCRIO6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBYE6DQ#issuecomment-544231182, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANRERGQ76V2VOHNTAMGG2VTQPQH3LANCNFSM4JCRIO6A.

edgardomortiz commented 5 years ago

Hi again, I don't see any attachment...

a2g1n commented 5 years ago

Now? test.vcf.gz

a2g1n commented 5 years ago

Hi again @edgardomortiz I just tried with a nonsensical value of -m -1000 and it does not filter out any SNPs. So it is definitely dependent on -m I think?

ke-crawford commented 5 years ago

I had this problem too and it was solved by including a nonsensical value of -m.

A side question: is it possible to get it to include MNPs? I'd like to keep these in my fasta.

edgardomortiz commented 5 years ago

Hi again @edgardomortiz I just tried with a nonsensical value of -m -1000 and it does not filter out any SNPs. So it is definitely dependent on -m I think?

Hi, I think I fixed the bug, thanks for finding it. Could you re-clone the repository and re-run the script on your files to see if it behaves correctly now?

Edgardo

edgardomortiz commented 5 years ago

I had this problem too and it was solved by including a nonsensical value of -m.

A side question: is it possible to get it to include MNPs? I'd like to keep these in my fasta.

@ke-crawford the problem with MNPs is that even though they are usually the same length across samples (when you can assume that they are aligned) there are also cases where they come unaligned or have different lengths (for example I saw that many times coming from freebayes). The solution in this case I to normalize allele variant representation with something like vcfallelicprimitives, check here: https://github.com/ekg/freebayes#normalizing-variant-representation. I other words, to convert all MNPs to SNPs.

Edgardo

edgardomortiz commented 4 years ago

Closing the issue, @ke-crawford feel free to re-open