TimoLassmann / kalign

A fast multiple sequence alignment program.
GNU General Public License v3.0
124 stars 29 forks source link

Feature requests: exporting distance matrix & medoid sequence #15

Closed andreaswallberg closed 4 years ago

andreaswallberg commented 4 years ago

Dear @TimoLassmann ,

Thanks for this tool!

I am currently using it to align very large sets of pre-clustered "families" of RNA transcripts in the absence of a reference genome to map them too. The performance is stellar: I aligned 36,000 reads in 8 minutes. Amazing!

My downstream application would benefit from being able to export the pairwise distance matrix and the guide tree, and if possible, also the "most representative" so-called medoid sequence which has the shortest combined pairwise distance to all other sequences in the set. All of these pieces of information are presumably present inside the running application.

I wonder if you could consider implementing methods to export them (i.e. a simple "seq_A seq_B distance" table, a NEXUS/Newick-tree, and the name of the sequence with the overall shortest distance to all else). This would be much appreciated.

Cheers!

TimoLassmann commented 4 years ago

Hi Andreas,

Older versions of kalign did have the option to export the tree. However, in this version I decided to leave this out since the guide tree is built using many heuristics. This is fine for alignment but I would recommend to use a purpose build package like MrBayes to get a high quality phylogenetic tree from the alignment.

Cheers, Timo