NicolaDM / MAPLE

MAPLE - a new approximate approach for maximum likelihood phylogenetics at short divergence.
GNU General Public License v3.0
43 stars 9 forks source link

Extension for Maple format: including insertions #8

Open corneliusroemer opened 2 years ago

corneliusroemer commented 2 years ago

In order to allow lossless compression, insertions need to be included in the Maple format.

I could find this edge case specified in the preprint.

It's easy to do, one just needs to agree to a convention, e.g.

2134 ins ACGTT

for an insertion of ACGTT after (or before) nucleotide 2134.

Alternative: no need for magic word, one simply includes multiple letters instead of one (I think this would be akin to VCF). If 2134 is usually C, one would write:

2134 CACGTT

for an insertion of ACGTT after nucleotide 2134.

Would be good of you could include treatment of insertions in the preprint.

I think both proposals would work in principle. Both have advantages.

The first is a bit more explicit, the second doesn't require a magic word.

NicolaDM commented 2 years ago

There are a few complications here that need to be ironed out. It is true that the MAPLE format is a lossless compression of an alignment, but only for alignments to a reference (in which insertions are usually removed) and not of a full multiple sequence alignment. Insertions are the pickle: if two sequences both have insertions at the same position of the reference, these two can be aligned on top of each other in the MSA if they descended from the same insertion event, or they can be non-overlapping if they are the result of two different insertion events. In phylogenetic terms, these two scenarios are different. I don't think this is a problem: MSAs are anyway inferred from the data, and one could use a MAPLE format and use appropriate phylogenetic inference to estimate if two insertions are descendants of the same event or not, so in a sense an individual MAPLE file would represent a class of similar MSAs. However, to do this, one would need to include indels in the probabilistic model of sequence evolution in MAPLE - something that I would really love to do, but is not yet at the very top of the to do list.

Regarding the insertion format, indeed the second one you propose is the one I had in mind since it's convenient and similar to the notation people use in VCFs. Right now I don't have a urgent reason to write scripts that create maple files with insertions: these would anyway be ignored by the phylogenetic inference, and all the SARS-CoV-2 fasta alignments I have are alignments to the reference given that proper MSAs of millions of genomes would be very hard to build and would be absolutely massive in size. However, I think we can consider in principle this VCF-like insertion format as part of the maple format for for future reference. Thinking about how to address this topic in the future, rather than first creating a script than translates a full MSA to a maple file with insertions, I would think it would be more practical in instead write a script that translates a collection of pairwise alignments to the reference into a maple file with insertions - the reason again would be that I am not sure how practical it is in the first place to create and store a fasta MSA with so many sequences!