JosephCrispell / homoplasyFinder

A tool to identify and annotate homoplasies on a phylogeny and sequence alignment
GNU General Public License v3.0
19 stars 3 forks source link

Output inconsistent sites as VCF file #7

Open JosephCrispell opened 4 years ago

JosephCrispell commented 4 years ago

An important enhancement pointed out by flashton2003. Thanks!

Include option in homoplasyFinder to export the consistencyIndexReport_DATE.txt file in a VCF file format.

The VCF file can then easily be run with bcftools csq to look at the potential consequences of the changes at the inconsistent sites.

Ideally this would require a reference genome file to be provided so the REF column is accurately filled out. It might be possible to avoid this if I assume the allele with the largest count if the reference/ancestral allele.

flashton2003 commented 4 years ago

Hey Joe, here is my script for converting the homoplasyFinder output to VCF. It assumes that the input to homoplasy finder was generated by snp-sites and the script requires the VCF formatted snp-sites output to convert the position in the snp-sites output to the genomic position.

https://github.com/flashton2003/homoplasy_finder_scripts/blob/master/convert_to_vcf.py

One behaviour which is maybe unexpected is that if there are multiple var alleles at the same position it outputs them on different lines. In my brief experiments bcftools csq couldn't interpret mutli-allelic (>2) vars, hence this hacky workaround. Because of this (and other reasons, mainly that no single genome has all the vars in the HF, so doing a joint interpretation is not valid) you have to use bcftools csq with the -l option so that each var is interpreted by itself, otherwise it falls over because of overlapping variants (IIRC)

Might be useful to someone? I've only really use it once so definitely worth sanity checking the outputs.

JosephCrispell commented 4 years ago

Thanks Phil. That is some nice looking code - beautifully written! I will give it a go and see if it is as good as it looks!

I am planning on writing a similar function in R, this will be extremely helpful - as long as it is ok for me to base my code on?

JosephCrispell commented 4 years ago

Write as separate function in R package. Takes in reference genome as parameter, as well as vector of adjusted positions of original fasta (optional). Where multiple non-reference alleles present - split over multiple lines in VCF (as Phil says above).