Closed gorliver closed 2 years ago
@gorliver thanks for asking, very interesting question!
Edlib doesn't support anything like this, nor do I know of algorithms that take anchors into account, but it does sound like a valid use case and I wouldn't be surprised if there are solutions out there for this -> it is just that I haven't been looking into them.
I am guessing that here the key is in defining how you want the algorithm to work in regards to these anchors -> is it ok to have gaps immediately next to anchors, or should those be penalized more than gaps that are further away? Because if it is the same, you could probably just split both query and target strings into multiple substrings at the places where the anchors are and that is it, solve each of those as an individual problem. So if you have AATGAAT
and ATAGAAT
you would split it into (AAT
, ATA
) and (AAT
, AAT
) and solve those as two individual alignments and then stitch those together. But, this feels pretty simple -> if I was you I would be inclined to search for an algorithm that is "smarter" regarding the anchors.
I will close this issue for now but feel free to ask more questions and we can continue the discussion below.
@Martinsos Thank you for your reply. The "split-stitch" sounds a great solution and smart enough for me :-) I will definitely give it a try.
Hi Martin, I wonder do you have any plan to imply an anchored alignment? That means conducting the alignment from several anchor points. I'm asking this because I often face unwanted alignment in repetitive sequences. For example, two sequences:
seq1 AATAATAATGAATAATAAT seq2 AATGAATAATAATAATAAT
The correct alignment is:
AATAATAATGAATAATAAT --------AATGAATAATAATAATAAT
because the "G" mark the correct postion. However, I offen got alignment like this: AAT-AATAATGAATAATAAT AATGAATAAT-AATAATAAT
This example is a simplified one as in my real data, there are mismatches/gaps to make the second alignment has a higher score.
I wonder that is there a way that I can provide a set of "anchor points" so that the final alignments has to match the anchor point pairs in the two sequences?
Many thank, Gorliver