Closed RagnarGrootKoerkamp closed 2 years ago
Good point, this is an oversight on my part. I think it does matter in practice because chains of insertions and deletions makes finding the optimal alignment slightly more difficult for block aligner.
One potential benefit here is that the actual edit distance of the generated sequence is probably a bit higher than in the completely uniform model. At least my intuition is that in your code the probability of mutations 'cancelling' each other is smaller.
Anyway, I'll likely write a linear time rust implementation for this soon.
I've just pushed an implementation using ropes here, in case you'd like to reuse it. I'll likely package it separately at some point.
Fixed the implementation of the rand_mutate
function by using a slightly different approach than before.
It seems that
rand_mutate
is slightly limited in the kinds of mutations it can generate, since it decides up front whether to generate a match/substitution/insertion/deletion for each position. This way, it can never insert two or more consecutive characters.See this line: For each generated insertion, it directly pushes the relevant character from
a
after pushing the inserted character.Something like an insertion followed by a substitution also can't be generated in this model.
Probably this doesn't matter too much in practice, but it's a slight deviation from the common model of generating the mutations one by one, as done in e.g. wfa.