Support indirect sorting (so colour payload is an index rather than bitvector)

mmuggli commented 8 years ago

We may want to run salmonela on a machine with decent memory but no SSDs. If it's like the ecoli4k dataset, color data may take up most of the space. (salmonela is about the size of E. coli). I'm imagining streaming the color data to disk during merging while storing an index into that data with each k-mer. Construction could then hopefully fit in memory. We would memory map the color file in order to build the RRR which should only do sequential access to the color data. Upon traversal, we'd first look up the index associated with a k-mer and then use that index to get the actual color vector from the RRR.

simon-j-puglisi commented 8 years ago

Interesting idea. There is probably some literature on external memory (or semi-external memory, as you're proposing) graph traversal. Maybe have a look for papers by Ulrich Meyer, Peter Sanders (probably from about 5-10 years ago, or more) --- that should lead you in the right direction.

On Sat, Apr 23, 2016 at 12:55 AM, Martin Muggli notifications@github.com wrote:

We may want to run salmonela on a machine with decent memory but no SSDs. If it's like the ecoli4k dataset, color data may take up most of the space. (salmonela is about the size of E. coli). I'm imagining streaming the color data to disk during merging while storing an index into that data with each k-mer. Construction could then hopefully fit in memory. We would memory map the color file in order to build the RRR which should only do sequential access to the color data. Upon traversal, we'd first look up the index associated with a k-mer and then use that index to get the actual color vector from the RRR.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/cosmo-team/cosmo-issues/issues/178

mmuggli commented 8 years ago

The main point here is there is no sense in moving around a 28,000 bit vector per kmer while we are sorting kmers when a 32 bit index would suffice. The kmers need to be sorted in colex order but the color vectors can be kept in the order we get them.

I think external is still the way to go, but on the machine that stores the strains, this may make sense.

Martin Muggli +1-(303)-947-6680 http://martinmuggli.com On Apr 22, 2016 10:57 PM, "simon-j-puglisi" notifications@github.com wrote:

Interesting idea. There is probably some literature on external memory (or semi-external memory, as you're proposing) graph traversal. Maybe have a look for papers by Ulrich Meyer, Peter Sanders (probably from about 5-10 years ago, or more) --- that should lead you in the right direction.

On Sat, Apr 23, 2016 at 12:55 AM, Martin Muggli notifications@github.com wrote:

We may want to run salmonela on a machine with decent memory but no SSDs. If it's like the ecoli4k dataset, color data may take up most of the space. (salmonela is about the size of E. coli). I'm imagining streaming the color data to disk during merging while storing an index into that data with each k-mer. Construction could then hopefully fit in memory. We would memory map the color file in order to build the RRR which should only do sequential access to the color data. Upon traversal, we'd first look up the index associated with a k-mer and then use that index to get the actual color vector from the RRR.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/cosmo-team/cosmo-issues/issues/178

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/cosmo-team/cosmo-issues/issues/178#issuecomment-213664767

alexbowe commented 8 years ago

That should be pretty easy. If you are using the develop branch, we can (conditionally, if NUM_COLS > 64) change the dbg_builder payload type to a size_t, and change the code inside the lambda (my merge uses a lambda instead of passing in buffers) to stream the colour vectors to disk, then push the kmer-index pair to the dbg_builder.

changing the payload type would be done here: https://github.com/cosmo-team/cosmo/blob/develop/cosmo-build.cpp#L155

and this lambda would be modified to write c to disk, and push a post-incremented index instead: https://github.com/cosmo-team/cosmo/blob/develop/cosmo-build.cpp#L184

Then, the lambda passed to the builder's build() method would need modification to do whatever it is we wanted to do with these indices (instead of building the colour matrix): https://github.com/cosmo-team/cosmo/blob/develop/cosmo-build.cpp#L203

Can sd_vectors be built sequentially too? Because the colour matrix seems pretty dense that I think we should be using that (but with inverted bits) instead.

alexbowe commented 8 years ago

Note that this would also overcome STXXL's limitation of only allowing fix-width types too, so we could support dynamic amounts of colours.

cosmo-team / cosmo-issues

Support indirect sorting (so colour payload is an index rather than bitvector) #178