cosmo-team / cosmo-issues

Issue repository for Cosmo (separate until we can transfer issues between repositories nicely - ignore code)
http://www.github.com/cosmo-team/cosmo
GNU General Public License v3.0
0 stars 0 forks source link

Find cause of uninit'd memory accesses in new edge index #127

Open alexbowe opened 9 years ago

alexbowe commented 9 years ago

This will allow us to get a better estimate of how long the traversal will take.

alexbowe commented 8 years ago

Implemented (with unit test) on feature/dna-bv branch. Can't use sdsl::hyb_vector yet because it doesn't support select, so uses sd_vector by default, and still need to make sure the API is okay for integration into DBG (I made a few assumptions about the alphabet which might require some changes to DBG code).

In the meantime, here are the benchmark times for 10^5 random queries over a 10^5-length random string of $acgtACGT (WT is Wavelet Tree, RS is the bit-vector-based rank-select index):

RS Construction average time per element: 27.7642 ns WT Construction average time per element: 58.1193 ns RS Access average time per element: 293.915 ns WT Access average time per element: 1747.63 ns RS Rank average time per element: 63.4227 ns WT Rank average time per element: 932.446 ns RS Select average time per element: 70.3594 ns WT Select average time per element: 1347.1 ns

Integrating should be as simple as removing the int_to_nt and nt_to_int calls (which are for int to ascii conversion, which aren't used internally by the dBG code). I'd like to coordinate this with @mmuggli though (to make sure I'm not breaking anything - it might be a good time to work on #141).

alexbowe commented 8 years ago

Need to finish serialization and change the way alphabets are handled. Going to work on merging with @mmuggli on Wednesday (Colorado time).

alexbowe commented 8 years ago

Finished serialisation support in commit https://github.com/alexbowe/cosmo-private/commit/5d535134bbf49ad5fcc0574eae0c9bd35c9bf8ce

Additional benchmark results (for space usage):

RS average bits per element : 6.26816 WT average bits per element : 3.78032 (the benchmark data is uniformly random and the underlying binary rank/select indices are sparse bitvectors, so may be different in practice)

@simon-j-puglisi how does this compare to the other indices you had in mind?

Remaining: support different input symbols (e.g. integers or another ascii encoding if desired)

alexbowe commented 8 years ago

Finished changing internal rep to ints (commit: https://github.com/alexbowe/cosmo-private/commit/22a288fe4cedc1e802ea48f7806bf10a512f0a16). Would be nice to support ascii (or whatever alphabet) like SDSL's wavelet trees do at some point.

Not closed yet as still need to test it with the debruijn graph. Should be possible to pass dna_bv_rs in as a template parameter to the debruijn_graph class now. Merging branches (#141) before doing this is a good idea (so we can test it with the bubble calling code).

alexbowe commented 8 years ago

Still needs sdsl::construct()-compatible (or construct_im) API to be a drop-in replacement for WT.

Or, we could add our own construct() method templated on the type that redirects to SDSL's, but is specialised for our own structures.