Closed abcosta closed 6 years ago
Hello,
Are those variants made of a single nucleotide? sometimes the ALT field in the vcf is a short sequence representing an insertion, and the script skips those, (see the 3rd and last variant in the image attached)
That is all I can guess without having the actual VCF to examine it, if you want you can send it as well so I can take a look.
Edgardo
On Sun, Sep 9, 2018, 01:24 abcosta notifications@github.com wrote:
Hi,
I'm using your python script to convert my vcf file to nexus and phylip formats. However, I realized that there is a discrepancy in the number of sites between the input and output files. I am currently using a VCF file with no missing data and 681 variant sites, and both output files (nexus and phylip) have a total of 647 sites. The same happens if I use a vcf file with more sites (including missing data). Is there an explanation for such difference?
Thank you.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/edgardomortiz/vcf2phylip/issues/5, or mute the thread https://github.com/notifications/unsubscribe-auth/AFgBvKq58YA2lSzZaQA9OCZawd5rTbsYks5uZFG1gaJpZM4WgEdr .
Hi,
I think you are right. I couldn't find the image attached (mentioned in your message) but I checked my vcf file and I found 34 sites (including 6 that only appear in the REF) with more than one nucleotide. Removing these 34 sites from the original 681, the number of sites decrease to 647.
Thank you for your help!
I sent it as an attachment from gmail, sorry, this was the image:
Glad to help
Dear Dr. Ortiz,
I have another question regarding the format of the phylip/nexus matrix. I'm trying to run RAxML with ascertainment bias correction (ASC) and I'm getting an error message* saying that my file has invariant sites, as the one below:
ARRRARRARRRAAAARARRRARARAAARAARARRAARRRRAAAAARRRRRRAAARRRRARRARRARRRR (position highlighted in the image attached - POS 18816)
*RAxML will consider positions like this as invariant due to the presence of an ambiguous base (R can also be considered as A according to the RAxML manual).
I read in the vcf2phylip github that "For heterozygous SNPs the consensus is made and the IUPAC nucleotide ambiguity codes are written to the final matrix." So my questions regarding this subject are:
(1) First, I'd like to make sure I understood what is happening when converting the vcf file to phylip format: when a position for an individual has 0/1 (Ref/alt alleles), the script will convert it for one of the ambiguity codes, right? (as seems to be the case in the figure attached)
(2) Second, I'd like to ask your opinion: Is there any other alternative to solve this ambiguity so then I don't need to delete all the sites indicated by RAxML? Or do you think it is better to delete the sites to be able to run RAxML with ASC?
Thank you for all your help! Have a great week, Ana
[image: Screen Shot 2018-09-11 at 10.06.17 AM.png]
On Mon, Sep 10, 2018 at 5:58 AM Edgardo M. Ortiz notifications@github.com wrote:
I sent it as an attachment from gmail, sorry, this was the image: [image: vcfheaders] https://user-images.githubusercontent.com/5767612/45293657-195a3a00-b4f9-11e8-9835-3d12b0038e67.png
Glad to help
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/edgardomortiz/vcf2phylip/issues/5#issuecomment-419873345, or mute the thread https://github.com/notifications/unsubscribe-auth/AogvUtjPfZnT6q0EnaZbuKfp6l_0QSlVks5uZkXBgaJpZM4WgEdr .
Dear Ana,
The answer to question 1 is yes, it will write the ambiguity code. For question 2: from the point of view of phylogenetic software that is an semi-invariable column since R=AG (see the explanation here: http://www.iqtree.org/doc/Frequently-Asked-Questions "Why does IQ-TREE complain about the use of +ASC model?”), the only way to make that column variable would be to turn the Rs into Gs, but that doesn’t make much sense to me, that column is simply not appropriate for the ASC model in RAXML or IQTREE.
That being said, IQTREE will take your file with these columns and produce a file that only contains variable columns on which you can run the ASC model if you want (with RAXML or IQTREE). Another option is to run other phylogenetic software on your data such as svdquartets or tetrad (from ipyrad) that handle SNPs better.
I hope this helps
PS: I didn’t get your attachment either, is better to write directly to my gmail account (e.ortiz.v@gmail.com) for this sort of things
Edgardo
Hi,
Thank you so much! That was really helpful. I had removed the invariant sites and tried both RAxML and IQTree (I really like IQTree). I might take a look into the SVDQuartets
In case of any future question with an attached file I will send to your email.
All the best, Ana
On Sep 12, 2018, at 2:56 PM, Edgardo M. Ortiz notifications@github.com wrote:
https://github.com/Cibiv/IQ-TREE/wiki/Frequently-Asked-Questions https://github.com/Cibiv/IQ-TREE/wiki/Frequently-Asked-Questions
Hi Edgardo,
I have a similar situation here. When converting my vcf file to nexus I loose some variants. But it says that all SNPs have passed the filters. Is there a way to find out what variants were removed?
Number of samples in VCF: 14 Total of genotypes processed: 50000 Genotypes excluded because they exceeded the amount of missing data allowed: 0 Genotypes that passed missing data filter but were excluded for not being SNPs: 0 SNPs that passed the filters: 50000 Biallelic SNPs selected for binary NEXUS:49422
As far as I know, I don't have any insertions...
I would guess that you had some with more than two alleles, you can check how many you have in your VCF with:
awk '$5 ~ /,/ {print}' file.vcf | wc -l
In your file I found 578 (50000-49422)
Also, you are getting 50000 SNPs in your PHYLIP, and regular NEXUS files. Those 578 non-biallelic SNPs are excluded only when creating the binary NEXUS for SNAPP.
sorry I forgot to reply. Yes, that was the issue.
Hi,
I'm using your python script to convert my vcf file to nexus and phylip formats. However, I realized that there is a discrepancy in the number of sites between the input and output files. I am currently using a VCF file with no missing data and 681 variant sites, and both output files (nexus and phylip) have a total of 647 sites. The same happens if I use a vcf file with more sites (including missing data). Is there an explanation for such difference?
Thank you.