FePhyFoFum / phyx

phylogenetics tools for linux (and other mostly posix compliant) computers
blackrim.org
GNU General Public License v3.0
111 stars 17 forks source link

Removal of hypervariable sequences #187

Open BenKuhnhaeuser opened 3 hours ago

BenKuhnhaeuser commented 3 hours ago

Hi there, I love how the phyx family is expanding to include so many useful capabilities. I would like to suggest a program that removes highly divergent / hypervariable sequences from multiple sequence alignments. Such highly divergent sequences can point to various issues, such as chimeric or paralogous sequences being mixed up in an alignment. The tool would calculate the number of sites divergent from the consensus sequence (similar to calculating missing sites), and remove sequences above the specified threshold.

A "sliding window" option would be even cooler to clean short stretches of highly divergent sequences in otherwise perfectly aligned and well-behaved sequences without discarding the entire sequence.

Many thanks for the consideration! Ben

josephwb commented 2 hours ago

This indeed sounds useful. I guess what would be necessary would be the definition(s) of "highly divergent / hypervariable". I can think of various ways to pick the "most different" sequence(s), but what would be the threshold? If you have concrete ideas on what you'd like to see, hen please share!