edgardomortiz / vcf2phylip

Convert SNPs in VCF format to PHYLIP, NEXUS, binary NEXUS, or FASTA alignments for phylogenetic analysis
GNU General Public License v3.0
294 stars 85 forks source link

vcf file (monomorphic SNPs discarded) to phylip #9

Closed Carol-Symbiomics closed 5 years ago

Carol-Symbiomics commented 5 years ago

Hi! I'm interested in using RAxML to assess how my samples cluster based on RADseq population SNPs. I converted my VCF file to phylip format using your phyton code. I'm completely sure I don't have monomorphic SNPs in my data set. However when I run RAxML with the ASC correction option (recommended when using only variable sites=SNPs) the program displays and error: "For partition No Name Provided you specified that the likelihood score shall be corrected for invariant sites via an ascertainment bias correction. However, some sites in this partition are already invariant. This is not allowed, please remove all invariant sites and try again, exiting". How is this possible. Can someone help me?

edgardomortiz commented 5 years ago

Hi there, I think it has to be with the particular states of those columns, if a column has Rs and As for example it is considered invariable because R is A or G. See a better explanation here in the IQ-TREE FAQs (the very last question): http://www.iqtree.org/doc/Frequently-Asked-Questions

IQ-TREE will filter out those invariable columns if you try to run analysis with the +ASC model and your matrix still has this kind of invariable sites. It will automatically create a new matrix with extension .varsites.phy which can be analyzed with +ASC either in IQ-TREE itself or RAxML.

Edgardo

Carol-Symbiomics commented 5 years ago

Hi Edgardo,

Thanks for your quick reply. I'm a new user of RAxML, but was examining the converted vcf file and noticed some "weird" characters (e.g. K, W, R, N, S Y, M), is that normal? Shouldn't I have all my SNPs concatenated? My SNPs are biallelic, so will they be exchanged to an R if they are either A or G??

With your phyton code, is there a way to only concatenate the SNPs in the VCF, avoiding this iqtree nomenclature http://www.iqtree.org/doc/ ?

Thanks in advance for your help

edgardomortiz commented 5 years ago

All the genotypes in your VCF are transformed to their IUPAC ambiguity code because the output matrices have a single sequence per sample. It is normal to have the ambiguity codes in the matrices, these ambiguity codes are routinely analyzed in phylogenetics, but they can violate the conditions of the ascertainment model.

Or, I guess I don't understand very well your question: "Shouldn't I have all my SNPs concatenated?", would you mind to clarify? What kind of output were you expecting?

Edgardo

Carol-Symbiomics commented 5 years ago

You understood correctly. I thought there was a way to concatenate the SNPs into a fasta file avoiding the use of the IUPAC ambiguity code. It looks to me that I will have to use the IQ-tree to filter out the "non-variant" sites, cause RAxML doesn't have that option. Thanks for your time!

edgardomortiz commented 5 years ago

No problem,

Also, you have some alternatives for your analysis (since you are interested in the clustering and not so much in the branch lengths where the ascertainemnt correction becomes more relevant), you could use your SNPs matrix (in NEXUS format) with svdquartets which is now part of PAUP, or simply analyze them in IQ-TREE (or RAxML) without the ascertainment correction.

Edgardo