`rand_mutate` cannot generate consecutive insertions

Daniel-Liu-c0deb0t / block-aligner

SIMD-accelerated library for computing global and X-drop affine gap penalty sequence-to-sequence or sequence-to-profile alignments using an adaptive block-based algorithm.

https://crates.io/crates/block_aligner

MIT License

124 stars 7 forks source link

`rand_mutate` cannot generate consecutive insertions #9

Closed RagnarGrootKoerkamp closed 2 years ago

RagnarGrootKoerkamp commented 2 years ago

It seems that rand_mutate is slightly limited in the kinds of mutations it can generate, since it decides up front whether to generate a match/substitution/insertion/deletion for each position. This way, it can never insert two or more consecutive characters.

See this line: For each generated insertion, it directly pushes the relevant character from a after pushing the inserted character.

Something like an insertion followed by a substitution also can't be generated in this model.

Probably this doesn't matter too much in practice, but it's a slight deviation from the common model of generating the mutations one by one, as done in e.g. wfa.

Daniel-Liu-c0deb0t commented 2 years ago

Good point, this is an oversight on my part. I think it does matter in practice because chains of insertions and deletions makes finding the optimal alignment slightly more difficult for block aligner.

RagnarGrootKoerkamp commented 2 years ago

One potential benefit here is that the actual edit distance of the generated sequence is probably a bit higher than in the completely uniform model. At least my intuition is that in your code the probability of mutations 'cancelling' each other is smaller.

Anyway, I'll likely write a linear time rust implementation for this soon.

RagnarGrootKoerkamp commented 2 years ago

I've just pushed an implementation using ropes here, in case you'd like to reuse it. I'll likely package it separately at some point.

Daniel-Liu-c0deb0t commented 2 years ago

Fixed the implementation of the rand_mutate function by using a slightly different approach than before.