Make Edlib faster for shorter sequences

Edlib is currently not as efficient for shorter sequences as it is for longer ones (especially queries). Shorter meaning <100, although I would have to do more tests to confirm the exact threshold.

Since there are many cases where users want to align short sequences (often from 75 to 150 in length), we should find a way to improve this!

The biggest reason for Edlib being slower for shorter sequences is certainly the nature of Myers's fast bit-vector algorithm, which is the central algorithm of Edlib. Myers's algorithm processes query in chunks of 64 elements, in parallel as if they were one element, achieving ideal speedup of 64x but in practice reduced for some factor of housekeeping. Of course, this means that for queries smaller than 64 big part of speedup is lost, and at some threshold it actually has opposite effect of housekeeping jobs making the algorithm slower than some simpler algorithms. Additionally, there are other housekeeping jobs (allocation/deallocation of memory, input preparation, preparation of internal structures) which are needed for Myers's algorithm but may be eating a large chunk of computation time for very short sequences. Finally there is transformation of sequences that is currently done, which may also be consuming significant amount of time for short sequences.

Ideas:

Add another algorithm to Edlib, which is specialized for shorter sequences. I have to yet explore which algorithm is the best fit for this. It should also be banded and work for all alignment methods. And be able to return alignment path. Then, have Edlib internally choose which algorithm to use based on sequence sizes. I feel like even just going with the simpler algorithm may be enough here, because it will have less housekeeping.
Do profiling on housekeeping jobs when aligning short sequences. When aligning short sequences, users often want to align many of them at once (to same target or different targets).
If impact of housekeeping jobs is high, consider how to do a batch processing of those short sequences (that means you would have users provide Edlib with all of them at once instead of calling align() on each pair) in such way that housekeeping jobs are shared (reusing allocated memory and structures?), therefore reducing total time spent on housekeeping.
Other idea is not to group queries in batches, but actually have a way to externalize needed memory so it is allocated only once and then shared among independent align() calls - in this case user would be aware of this (since we can't use C++ class that would hide that memory).
Currently I transform sequences. Do profiling, if that consumes significant amount of time than I should look for a way to avoid it.

Ideally, 2. and 3. should be explored after implementing 1., since they will be specific for that algorithm, they are not a problem for long sequences.

Hi @Martinsos,

Thanks for breaking this out into a separate issue. I like your general assessment and proposed (potential) solutions. While I see the overhead for very short sequences, I'll note that most modern sequencing libraries are > 64bp (though there are still some 50bp lingering about, especially with some of this single-cell RNA-seq data). Common read lengths are 75, 100 and 150bp. I guess the issue isn't just sequences shorter than 64bp, but those whose length is far from a multiple of 64bp.

Also, regarding amortizing the cost of bookkeeping, I agree that a batch processing mode would be nice. I also think it would be nice to have some variant of the core functions where one can provide the necessary "working space" for the aligner. This would allow the user to provide the space for different alignment tasks and avoid the the memory allocation ' deal location overhead on each call. This is especially important in a case like the one I am imagining where each thread (of, perhaps, many) is processing reads and it's ideal for each thread to have it's own single "scratch" space for its alignment jobs. Additionally, this kind of approach would allow you to implement a sort of "caching" strategy internally, where you could simply have a memory pool of working scratch space and you only allocate new space if all working space is in use (or if none is of adequate size).

The other question / though I'd add here is, how difficult would it be to expose a variant of the alignment functions that didn't require the pre-conversion step prior to alignment. Since, in the read alignment problem, edit distances are typically small (i.e. It's common for reads to align to the reference with an edit distance of 2 or 3), it seems that the relative overhead of a linear walk over the read to convert reference and query sequence into an alternative format could become higher than it is for long protein sequences that are potentially much more divergent.

No problem, I should thank you for all the information, this helps a lot :)!

Important question: is it usually many short queries on single target, or multiple targets?

Ok, great to know that common read lengths are 75, 100 and 150, that gives me a good idea on what to focus.
Right, if they are far under a multiple of 64bp they are loosing some computational power. On longer sequences that has no impact, but on shorter it certainly does.

Regarding the idea with providing "working space" for the aligner - the motivation is that it would be more flexible than batch mode, correct? Sounds interesting, although it also sounds pretty complicated - I would need to have some sort of my own memory management logic, to use that space correctly.

Ok, I get what you mean by avoiding transforming the sequences. Transforming them simplifies some other steps in the algorithm. I could probably avoid transformation, but I am not sure at this moment if it would actually improve performance or just "move the work" to the other part of the algorithm. I will certainly give it a thought, but first of all I will run profiling to see if this actually has any impact. The thing is, algorithm itself is quadratical, so all time consumption of linear parts is in general negligible , but again, on such short sequences things may behave differently than I expect, so profiling is the way to go. Regarding the input format of sequences, is char* ok with you?

By the way, are you aware of the parameter k? If you know that expected edit distance is most likely smaller than some value k, than it is a good idea to provide it as a parameter k to the Edlib, who will run much faster in that case (well, only for longer sequences again :)).

Ok, so my thoughts so far are:

definitely try to find another algorithm that will perform faster for smaller sequences, and use it next to the existing one. Especially focus on distances from 75 to 150.
do profiling on allocation / deallocation of memory and see if it has an impact on speed, if yes, consider how to minimize the number of allocations/deallocations. Possibly use shared memory, shared between independent align() calls, so user has flexibility.
do profiling on time spent on transforming sequences, and if it has an impact on speed, consider if this can be avoided.

Since I am sure that both 2. and 3. are problems only for shorter sequences, and as said in 1. I should find another algorithm for shorter sequences, than it makes sence to first do 1. as new algorithm may by design avoid 3., maybe even 2. in some amount.

I did some profiling on current master (c1f04e8e11b232c0fc3baa462e0a579fd3bdad4d), to see how short sequences behave compared to longer sequences.

I chose 2 different scenarios, one is aligning a read to target (read being much smaller than target) using infix(HW) method, and another is aligning query to target (query and target being about the same size), using global(NW) method. For both of these scenarios I ran 2 tests: one with short sequences and another with long sequences, so I can compare how Edlib behaves for shorter ones compared to longer ones. For short ones I ran it multiple times in order to ensure that computation time is long enough for precise measurements.

I used valgrind tool=callgrind for this. I attached .zip with callgrind outputs for all 4 tests.
I used following format for naming them: callgrind-<query_length>-<mode>-<target_length>-ed_<edit_distance>-g.out (g mean that debug flag was used when compiling).
I used kcachegrind to observe the callgrind results.

Profiling results

For long sequences, all time is spent on housekeeping tasks in corresponding myers... function (~32% of time) and calculateBlock function (~58% of time), which is the core of the Myers's algorithm and where the actual calculation is happening. Other 10% is spent on tiny little bits all around. Memory allocation/deallocation takes no time at all, and sequence transformation takes < 1%. We could say that this is a healthy distribution of time usage, and this is the same for both HW and NW case.

For short sequences, for NW case (two seqs of 80bp length, ~90% similarity), calculateBlock function consumes only 13% of time, meaning very little time is spent on actual calculation. Not to mention that calculation in calculateBlock is working with 80/128 = 62% efficiency, because it works with 64bp blocks, so only 0.64 * 13% = 8% of time is actually spent on pure calculation, compared to 58% for longer sequences, which is 6 times worse performance. Housekeeping jobs in myers.. function take about 24% + 6% = 30% of time, same as for long seqs.
So who stole the time from calculateBlock? It turns out that buildPeq took 34% of time, while it normally takes < 1% for longer seqs! This is a part where special structure is built to speed up the calulation later, but because target is long it is not used often enough to make its construction negligible.
Also, 10% goes to sequence transformation.
Again, negligible amount of time is spent on memory allocation/deallocation.

For short sequences, for HW case (query is 80bp, target is 100kbp), situation is not so bad because target is long (I should probably also try this with shorter target). 45% goes on housekeeping in myers.., and 39% goes on calculateBlcok(), which is not so bad. Because of lower efficiency of calculateBlock() we actually have 0.64 * 39% = 25% of pure calculation time, compared to healthy 58%, which is > 2 times worse performance.
Again, 10% goes on transforming sequences, and negligible amount of time is spent on memory allocation/deallocation.

Summary

memory allocation/deallocation is always negligible in time consumption and not a problem.
transforming sequences does come up to 10% of time consumption for short sequences, but that is still not much compared to other problems.
buildPeq goes from 0% to 30% for short targets (I should exactly determine the threshold), which is a big problem.
a lot of performance is lost because Myers works with 64bp blocks.

From this, I would say that focusing on memory management and sequence transformation is not important. Solving 3. and 4. is the key, and only solution I can think of right now is using another algorithm for shorter sequences (which is hopefuly faster than Myers's algorithm in this case), which will most likely also avoid sequence transformation. Somehow modifying Myers's algorithm to work better for shorter sequences would be super cool, but I doubt this can be done - I should look into it.

NOTE: I am actually not sure if valgrind tool=callgrind is good for estimating how much time is spent on memory allocation! It counts number of instructions executed, which may not be a good measure since allocation memory is a slow operation. I have to investigate further how to do profiling that will correctly measure amount of time spent on memory allocation/deallocation.
Some interesting profiling tools that might be better for memory allocation time consumption: gprof, gperftools, I have to try them out.

NOTE: I did profiling with gprof, and it still seems that memory allocation has no significance (when on one thread), so I believe the summary is ok.

profiling.zip

@rob-p could you please give me some more insight on the sequences sizes that you usually align or expect to align? You mentioned that queries are often from 75bp to 150bp, but what about the targets? Is it usually 100bp query to 100bp target, or is it 100bp query to 10kbp target, or both, what are some usual numbers?

Hi @Martinsos --- that's for all of the great & quick feedback!

The usage scenario I envision is the following. I have many threads running independently and each one repeatedly want's to perform the following task. I have a single read (75 --- 150bp long) and I know approximately where I want to perform the alignment. That is, I have a putative position on the reference where the alignment should begin (though, perhaps I want to include a few nucleotides of "wiggle room" to allow the read to align slightly before / after this position).

That is, I have a scenario like this. I want to align a N basepair read to an N+10 (say) length substring of the reference. I know the approximate position where the read should start. As you mention above, I have an edit distance budget as well (i.e. a k), so that if the edit distance exceeds k, I'm no longer interested in retaining the alignment.

Now, each read will repeat this process of taking a read, using our quasi-mapping algorithm to determine where the read comes from, and then attempting to actually align the read at this position. The reason I've made so much noise about working space is that new and delete are well-behaved in single-threaded land, but can become a significant resource contention since threads will be asking for resources repeatedly. The other interesting thing about this use case is that since all of the reads are approximately the same length, it's not as if there will be many problems of drastically different size. So, the memory allocated for one problem can probably be re-used (after it is cleared) for >= 90% of the alignment problems a thread will encounter. In this case, I agree that it's probably too much trouble for you to implement a caching system and all of that complexity. However, one easy solution would be to provide a version of your main function that simply accepts an extra parameter (or extra parameters) that are the working space for the alignment. That is, the onus falls on the caller to have previously allocated the memory (perhaps by calling an edlib utility function) and to free the memory when it's no longer needed. This way, a thread can allocate the working space for it's first alignment, and then simply re-use this space for the rest of the alignments it performs.

Thanks, that gives me a much better idea of what you need this for!

For what you are doing, I would recommend using HW (infix) mode with set k.
What is exactly the result you are looking for? Only edit distance, or is it start/end locations of alignment, or is it alignment path?

I completely understand the idea behind the shared working space, and it does make a lot of sense! I see how it could bring benefit in this case.

I am somewhat careful about introducing this feature because I feel it will bring more than little complexity. I have many functions, some of them being used in different scenarios and on different places, regarding of the method used. I would have to propagate the "working memory" through all of them, and make sure that if "working memory" is not given, everything still works. I could probably just allocate it myself if not given or something as a way around this.
However, "working memory" means that in one central place I have the "knowledge" from all the functions - internal knowledge of how are they allocating the memory internally and using it. I like having things encapsulated as much as possible, while this would have exactly the opposite effect.
Also, this would require careful planning of how much space is needed upfront, which is not always easy to determine. If there is not enough space at some moment, I would have to detect and manage that.
All this would require changes through the whole codebase, coupling internal logic of certain functions with "working memory" and its management, while on the other hand also complicating the final API somewhat. Also, I expect all this to make further maintainance and upgrading of Edlib harder.
All this does not mean that this is not worth implementing - however, it makes me think hard before deciding to do it, and I would rather be sure that there is no other, less intrusive way to do this, and that benefits brought by this would justify the added complexity.

I believe that the biggest performance improvement can be made by implementing alternative algorithm for shorter sequences. Having this in mind, and also having in mind that memory allocation has impact only on short sequences, it is obvious that it would not make much sense to optimize the current algorithm for it.

Therefore, I suggest the following: let's wait until the algorithm for shorter sequences is implemented and in place. Then, we can revisit the problem of allocating memory and idea of "working memory" and see if it is still needed and how hard it will be to implement. It may be that with the new algorithm this will not be a problem at all anymore (maybe all memory needed for short sequences can fit on stack) or maybe implementing this change will be much simpler. Does that make sense?

I created a new issue #24 for the idea of shared working memory.

Martinsos / edlib

Make Edlib faster for shorter sequences #77

Profiling results

Summary