beling / bsuccinct-rs

Rust libraries and programs focused on succinct data structures
Apache License 2.0
126 stars 10 forks source link

Building `fmph::keyset::DynamicKeySet` from a `rayon::iter::ParallelIterator` instead of `Iterator` #3

Closed progval closed 1 year ago

progval commented 1 year ago

Hi,

I'm trying to build a DynamicKeySet (actually, a CachedKeySet) from a rayon::iter::ParallelIterator instead of an Iterator. I feel this makes sense given I have a bunch of iterators as my own input, and fmph internally uses them as well.

I attempted to do this by copying the DynamicKeySet structure, and crudely replacing every : Iterator bound with a : ParallelIterator. This works for the most part; however I am stuck implementing for_each_key and into_vec as their predicate lack the Send and Sync bounds necessary to call rayon's .map().

And adding these bounds to the KeySet definition seems excessive, given that not all users need them.

Any tips?

Thanks!

beling commented 1 year ago

Hi,

The F function in for_each_key is more efficient in the single-threaded version, and such a version is always used when rayon has only 1 thread (which can only be checked at runtime). The solution is a KeySet which will have two functions, the first creating an Iterator and the second a rayon::iter::ParallelIterator. I will try to implement something like this soon.

beling commented 1 year ago

Please check if the code I have just added to the repository works and is useful to you.

See: CachedKeySet::dynamic_par(keys: impl GetParallelIterator, clone_threshold: usize) and DynamicParKeySet. Note that GetParallelIterator is implemented by a pair of functions: (F: Fn() -> impl Iterator, PF: Fn() -> impl ParallelIterator)

progval commented 1 year ago

works fine on my test dataset (23 keys), but for a larger one (602041 keys) it seems to call par_iter an infinite number of times if it doesn't return None; despite my key order being consistent. I'll give you some minimal code to reproduce the issue as soon as I can.

beling commented 1 year ago

Thanks. I've just fixed a bug and added the test which suggests that everything should work now.

progval commented 1 year ago

Amazing, thanks! It works and I'm getting a 100× speedup on 100 threads.

I need to fix my code because a few duplicates in my "real life" dataset make the computation crash (7 duplicates over 27 billion keys :( ), but it's looking very promising

beling commented 1 year ago

Good to hear! I will send ph=0.8 to cargo soon. I've merged DynamicParKeySet with DynamicKeySet and changed some names (e.g. dynamic_par -> dynamic), see: https://github.com/beling/bsuccinct-rs/commit/01f3b12d118faa0f57843977dcbb34386f92c264

beling commented 1 year ago

Note that you can use try_with_conf_stats_or_partial (GO)Function constructor to find duplicates in expected linear time if you have no memory for other methods (like using simple HashSet).