Martinsos / edlib

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.
http://martinsos.github.io/edlib
MIT License
492 stars 162 forks source link

anchored alignment #190

Closed gorliver closed 2 years ago

gorliver commented 2 years ago

Hi Martin, I wonder do you have any plan to imply an anchored alignment? That means conducting the alignment from several anchor points. I'm asking this because I often face unwanted alignment in repetitive sequences. For example, two sequences:

seq1 AATAATAATGAATAATAAT seq2 AATGAATAATAATAATAAT

The correct alignment is:

AATAATAATGAATAATAAT --------AATGAATAATAATAATAAT

because the "G" mark the correct postion. However, I offen got alignment like this: AAT-AATAATGAATAATAAT AATGAATAAT-AATAATAAT

This example is a simplified one as in my real data, there are mismatches/gaps to make the second alignment has a higher score.

I wonder that is there a way that I can provide a set of "anchor points" so that the final alignments has to match the anchor point pairs in the two sequences?

Many thank, Gorliver

Martinsos commented 2 years ago

@gorliver thanks for asking, very interesting question!

Edlib doesn't support anything like this, nor do I know of algorithms that take anchors into account, but it does sound like a valid use case and I wouldn't be surprised if there are solutions out there for this -> it is just that I haven't been looking into them.

I am guessing that here the key is in defining how you want the algorithm to work in regards to these anchors -> is it ok to have gaps immediately next to anchors, or should those be penalized more than gaps that are further away? Because if it is the same, you could probably just split both query and target strings into multiple substrings at the places where the anchors are and that is it, solve each of those as an individual problem. So if you have AATGAAT and ATAGAAT you would split it into (AAT, ATA) and (AAT, AAT) and solve those as two individual alignments and then stitch those together. But, this feels pretty simple -> if I was you I would be inclined to search for an algorithm that is "smarter" regarding the anchors.

I will close this issue for now but feel free to ask more questions and we can continue the discussion below.

gorliver commented 2 years ago

@Martinsos Thank you for your reply. The "split-stitch" sounds a great solution and smart enough for me :-) I will definitely give it a try.