Martinsos / opal

SIMD C/C++ library for massive optimal sequence alignment (local/SW, infix, overlap, global)
MIT License
35 stars 10 forks source link

report GCUPS for opal_aligner ? #26

Closed jeffdaily closed 9 years ago

jeffdaily commented 9 years ago

I have used SWIPE in the past and am familiar with its output. Specifically, I'm interested in opal_aligner reporting the GCUPS for alignments, either individually or an average for the entire query. Is this possible for the opal_aligner?

Martinsos commented 9 years ago

Hi Jeff! Sure, that should not be a problem to add, I will address it in the next few days and let you know when it is done! Have you tried using Opal so far, how are you satisfied with it? Are there any other features you would like? If you have any suggestions or ideas or you would just like to talk about it, feel free to write them here or contact me on my mail (sosic.martin@gmail.com)!

jeffdaily commented 9 years ago

I have only used the opal_aligner application so far. I will probably evaluate opal later this summer in a much larger, distributed memory application. That's partially why I'm after the GCUPS reporting. Seems to me that's the golden standard these days for comparing apples-to-apples against other software applications and libraries (like mine https://github.com/jeffdaily/parasail).

I'm not sure what the memory requirements are for opal -- could you briefly explain? I ask because I'm interested in performing large analyses where the query is the same as the database (N choose 2, all-against-all alignments), where N would be on the order of tens of millions of sequences.

Martinsos commented 9 years ago

I added GCUPS with 3e676309ca7518d153a3c78ebfe0167c5a855cd4! I used a definition of CUPS that I found is used in other alignment libraries/tools: query_length * total_database_length / cpu_time, I believe that should be ok?

Impressive library! I would be interesting in hearing how did opal behave in your tests. Memory requirement for opal is query_length * K, where K is a constant, and about 30, 40 (depends on precision used). Basically, only one column of dp matrix is stored at a time, and that is the main memory requirement. What do you mean by N choose 2, all against all?