Martinsos / opal

SIMD C/C++ library for massive optimal sequence alignment (local/SW, infix, overlap, global)
MIT License
35 stars 10 forks source link

Implement calculation of alignment #1

Closed Martinsos closed 9 years ago

Martinsos commented 10 years ago

Right now we get just score, not alignment, so we should implement it. Place to start: SWIPE can do alignment, but does not have it documented in article. Examine SWIPE code to see how they are doing alignment.

tolot27 commented 10 years ago

The traceback would be really helpful. I've tested the SW Library but it has problems with large sequences (see mengyao/Complete-Striped-Smith-Waterman-Library#9). Your implementation does not have this limitation but lacks the traceback.

I'm already using swipe and wrote some extensions (soft-masking, compositional score/matrix adjustment [not intra-sequence based]). But for the final alignment step, I need a inter-sequence based traceback capable of handling large sequences. Maybe we can work together?

Martinsos commented 10 years ago

Hi, thank you for your interest! I agree that traceback would be useful, only reason it is not implemented is that I did not have time for it yet. As you know, my implementation is based on algorithm used in SWIPE. SWIPE has traceback, what are your experiences with it? What I was planning to do is inspect how they do traceback in SWIPE and than decide on the approach I will use. Sure, help is welcome!

tolot27 commented 10 years ago

For the aligment I just use the align function as implemented in swipe. Transforming the matrix is quite trivial. But giving the aligment end coordinates as hints to this function often returns in an internal error.

This is caused by an incorrect score calculation of swimd. I'll open a new issue for that.

Martinsos commented 10 years ago

@tolot27 Ok, I will certainly have to investigate that. Thank you for your contribution! Sorry for not responding more often, I was on vacation and I will be extremely busy for the next month, but after that I will have much more time to work on swimd! Thank you for your patience, I am looking forward to making swimd better.

Martinsos commented 9 years ago

I am planning to finally implement this one in the next few days.

Approach I will use is to write alignment function that does simple banded alignment. First we will use swimd algorithm to find score and end location(coordinates) of alignment, and then user will be able to call alignment function on that (function will take score and end location). This function will stop when score equal or greater then given score is found. I will have to handle multiple alignment modes here, but I will not have to adjust basic swimd algorithm.

Another approach that I was considering is somewhat more advanced: idea is to use swimd algorithm to find score and end location (same like in previous approach), then use it again, starting from end location, to find the start location of alignment (this would all be done in parallel on multiple sequences), where score would be provided in order to stop calculation when score is found. This is somewhat tricky part, as I have to be sure that start location actually corresponds to end location (which we can ensure by always taking the earliest end location I believe), and I have to use corresponding alignment mode while going back. Then, when we have start location, end location and score, we would run alignment function very similar to the one in previous approach, although it would be somewhat simpler since it will know both start and end, so it will always work in NW mode. While implementing this approach, it will be logical to also make some parts of it general, so I would like basic swimd algorithm to expose API to define maxScore that we need (if bigger score is found calculation stops), and it is also great that user will be able to search for start and end location without doing alignment (previous approach does not offer to find start location without finding alignment). I think this approach could be somewhat faster, as slow part, which is alignment function, will calculate smaller part of matrix (since it know start of alignment), and will also take less memory because of that. It is hard for me to say how fast is swimd algorithm compared to this banded alignment (parallel vs band), so if there would be any improvement and how significant it would be.

Although more promising, second approach is more complicated to implement so I will go with first approach for now, and try the second approach later. It will also be easier to extend to second approach from the first approach then vice versa, which is also an argument to implement the first approach for now.

Martinsos commented 9 years ago

First version is implemented! @tolot27 Next steps are to improve speed of alignment and to reduce memory usage. I will create separate issues for those two. Integrated with 58c6b83c9edf6666de0c9ff059309da26bf05f5b