Closed progval closed 1 year ago
Hi,
The F
function in for_each_key
is more efficient in the single-threaded version, and such a version is always used when rayon has only 1 thread (which can only be checked at runtime).
The solution is a KeySet
which will have two functions, the first creating an Iterator
and the second a rayon::iter::ParallelIterator
.
I will try to implement something like this soon.
Please check if the code I have just added to the repository works and is useful to you.
See: CachedKeySet::dynamic_par(keys: impl GetParallelIterator, clone_threshold: usize)
and DynamicParKeySet
.
Note that GetParallelIterator is implemented by a pair of functions:
(F: Fn() -> impl Iterator, PF: Fn() -> impl ParallelIterator)
works fine on my test dataset (23 keys), but for a larger one (602041 keys) it seems to call par_iter
an infinite number of times if it doesn't return None
; despite my key order being consistent. I'll give you some minimal code to reproduce the issue as soon as I can.
Thanks. I've just fixed a bug and added the test which suggests that everything should work now.
Amazing, thanks! It works and I'm getting a 100× speedup on 100 threads.
I need to fix my code because a few duplicates in my "real life" dataset make the computation crash (7 duplicates over 27 billion keys :( ), but it's looking very promising
Good to hear! I will send ph=0.8
to cargo soon.
I've merged DynamicParKeySet
with DynamicKeySet
and changed some names (e.g. dynamic_par
-> dynamic
), see:
https://github.com/beling/bsuccinct-rs/commit/01f3b12d118faa0f57843977dcbb34386f92c264
Note that you can use try_with_conf_stats_or_partial
(GO)Function
constructor to find duplicates in expected linear time if you have no memory for other methods (like using simple HashSet
).
Hi,
I'm trying to build a
DynamicKeySet
(actually, aCachedKeySet
) from arayon::iter::ParallelIterator
instead of anIterator
. I feel this makes sense given I have a bunch of iterators as my own input, andfmph
internally uses them as well.I attempted to do this by copying the
DynamicKeySet
structure, and crudely replacing every: Iterator
bound with a: ParallelIterator
. This works for the most part; however I am stuck implementingfor_each_key
andinto_vec
as their predicate lack theSend
andSync
bounds necessary to call rayon's.map()
.And adding these bounds to the
KeySet
definition seems excessive, given that not all users need them.Any tips?
Thanks!