Martinsos / edlib

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.
http://martinsos.github.io/edlib
MIT License
506 stars 165 forks source link

Better explain start and end locations and possible alignments #93

Open Martinsos opened 7 years ago

Martinsos commented 7 years ago

In general, there are usually many different alignments with same score. Some of them have different start and end locations, but it can even be that they have the same start and end locations and scores and are still different alignments.

What edlib does, is is first find end locations of a few of best alignments. Maybe all of them, maybe not, it does not guarantee it will return all of them. I could implement that, but did not for the sake of speed. I could provide an option that tells edlib to return all of the end locations.

Then, for each end location, edlib finds only one start location, even if there is more of them. Again, I could find all of them, but I don't do it because of the speed.

Finally, I return cigar only for one alignment. I could return for more of them, but I don't, for the sake of speed. I could add an option for this, to return cigar for more of them, however I would have to figure out how to make this interface -> does user choose for which alignment, or do I return for all of them?

First of all, I should explain this better in the comments!

Second, I should think about making this interface more flexible and allowing the user to choose how much details he wants. But I need more input from users for this because I am not sure if this is important / needed.

Martinsos commented 6 years ago

In my tests, I find all end locations with simple algorithm and then check if Edlib found all of those also. That should fails from time to time since Edlib does not guarantee to find all the best results, but for some reason it never fails. What does that mean? Did I change how Edlib works and now it returns all the end locations always? I should investigate this, figure out what is going on, how exactly is this working and document it.

bobbyAtSperry commented 4 years ago

As a user I think that if you add the above explanation to the documentation that will be fine for a lot of people. I was confused about all the different Start-End pairs but the above explains it and I am happy with this approach. Good, useful, module BTW.

iprada commented 4 years ago

This came to my mailbox after @bobbyAtSperry comment.

In case it is helpful for you

"Edlib does not guarantee to find all the best results, but for some reason it never fails. What does that mean? Did I change how Edlib works and now it returns all the end locations always? I should investigate this, figure out what is going on, how exactly is this working and document it."

I find this feature really useful. I have developed a tool that has to remap very short sequences (range: 5-20bp) to small regions of the genome (range: 300-10000bp). The fact that edlib reports all alignments (or at least many of them) when it finds more than one "best result" is very useful to weight the alignment positions probabilistically and report only those that are probable enough. Of course, it would be great to know what is going on behind the behaviour ;)

Best,

Iñigo