cosmo-team / cosmo-issues

Issue repository for Cosmo (separate until we can transfer issues between repositories nicely - ignore code)
http://www.github.com/cosmo-team/cosmo
GNU General Public License v3.0
0 stars 0 forks source link

Improve dummy shift merging #187

Open alexbowe opened 8 years ago

alexbowe commented 8 years ago

Currently we generate only the needed dummy shifts (no duplicates) using the longest common prefix (LCP) lengths of the lexically sorted dummy nodes (which are obtained by reversing the outgoing dummy nodes). However, these still need sorting and merging, and the table is much larger than storing *only the dummy nodes (without their shifts) (usually 2% of the size of the BOSS matrix vs 77%).

I think that if we count the occurrences of each symbol in each position (taking the LCP values into account - ignore symbols that are inside the shared prefixes) in a sigma * k grid, we can store only the dummy nodes needed (the 2% sized table), and use the count table to "rank" each dummy node (that is, use something like the radix sort loop to count where in our dummy table each shift would be). Then, with a B-sized buffer and a D sized (non-shifted) dummy table, we could generate the shifts in D/B passes while merging.

It is doubtful this would be much faster (especially since STXXL uses parallel sorting), but it would require much less disk space.