Improve dummy shift merging

Currently we generate only the needed dummy shifts (no duplicates) using the longest common prefix (LCP) lengths of the lexically sorted dummy nodes (which are obtained by reversing the outgoing dummy nodes). However, these still need sorting and merging, and the table is much larger than storing *only the dummy nodes (without their shifts) (usually 2% of the size of the BOSS matrix vs 77%).

I think that if we count the occurrences of each symbol in each position (taking the LCP values into account - ignore symbols that are inside the shared prefixes) in a sigma * k grid, we can store only the dummy nodes needed (the 2% sized table), and use the count table to "rank" each dummy node (that is, use something like the radix sort loop to count where in our dummy table each shift would be). Then, with a B-sized buffer and a D sized (non-shifted) dummy table, we could generate the shifts in D/B passes while merging.

It is doubtful this would be much faster (especially since STXXL uses parallel sorting), but it would require much less disk space.

This only works because we add reverse complements before hand, and would need updating if support for something like BCALM (which removes the need for reverse complements) were added.

cosmo-team / cosmo-issues

Improve dummy shift merging #187