Closed dhardy closed 6 years ago
I have not looked all that closely at what the sample
functions are supposed to do to be honest. Does this summarize it?
The new seq::sample*
functions and Robert Floyds combination algorithm https://github.com/rust-lang-nursery/rand/pull/144 are supposed to pick values from a range without returning any duplicates. And the problem is that all the previous results have to be kept in memory and checked against collisions.
Is there really not a better solution? It should be possible to adapt RNG's like PCG (but anything LCG or MCG-based) to work on a certain number of bits. What if we use an RNG with just enough bits to be able to generate all possible values, and reject the ones outside the range? Then all values are guaranteed to only occur once, and no vector with many comparisons or a more complex hashmap are necessary. And no dependency on alloc
or std
.
This has the disadvantage of not working with a user-supplied RNG. And maybe I am missing something big? cc @vitiral, @burdges
maybe I am missing something big?
:smile: Realized it just after commenting. If we have to pick values from, say, 0..800 we need 10 bits. That would also mean the seed of the RNG is 10 bits. Then the RNG can never generate more than 2^10 different orderings, 1024. While there are 800! possible orderings, which is basically infinite in my book. So that is an argument to not use this method, but one that can utilize an RNG with a larger period. With an RNG with a period of 2^64 or 2^128 it is at least impossible to exhaustively list all possible orderings.
That's some out-of-the-box thinking! I guess your custom sampling RNG can be seeded from the provided RNG, so your idea could work, if there is a way to reliably generate a high-quality random-hop RNG visiting all nodes of the given range exactly once in random-like order. That sounds hard to me!
Mostly though I made this issue in the hope that someone can design a nice API tying all the above things together.
Mostly though I made this issue in the hope that someone can design a nice API tying all the above things together.
But that is much more difficult! :smile:.
I see WeightedChoice
as a somewhat odd functionality that not many will use. Do you think there is a place in rand for functions like this? That implement some smart algorithm to solve a not entirely uncommon problem?
Just today I found out about this nice problem of picking a random point on a sphere: http://mathworld.wolfram.com/SpherePointPicking.html. This is the same problem as picking a direction in 3D, which comes up in games.
Would it makes sense to add a module collecting stuff like this?
They may use the Distribution
trait if that is convenient, but I don't see that as a reason to live in the distributions
module.
I see WeightedChoice as a somewhat odd functionality that not many will use.
That was my thought too; removing it is definitely an option.
An PRNG can only generate unique values in a range with high probability if the range is much larger than the number of values requested.
I noticed sample_slice
incurs a potentially undesirable allocation. At least sample_slice_ref
should avoid allocation by converting all the indices to references directly via something like:
fn index_to_ref<'a,T>(slice: &'a [T], index: &'a mut [usize]) -> &'a [T] {
let s = slice.as_ptr();
let l = slice.len();
for i in index.iter_mut() {
assert!(*i < l);
*i = unsafe { s.offset(*i) } as usize;
}
unsafe { mem::transmute(index) };
}
// I donno if this `mem::transmute` call could/should be replaced with some more idiomatic raw pointer call.
// There is no safe linear time way to implement the mutable variant
// fn index_to_mut<'a,T>(slice: &'a mut [T], index: &'a mut [usize]) -> &'a mut [T]
// but unsafely index_to_ref could be coerced into doing the mutable version for us.
Isn't there a use case for returning the second half of a Fisher-Yates shuffle applied to permute the first half? I think it gives you a faster combination in some situations: https://github.com/rust-lang-nursery/rand/issues/169#issuecomment-326428197
In principle, we could provide no_std
variants that took in this user supplied buffer and call it for the -> Vex<T>
versions. At least Robert Floyds combination algorithm works:
fn combination_slice(rng: &mut R, range: std::ops::Range<usize>, s: &mut [usize]) {
let mut s = ::slicevec::SliceVec::new(s); // https://github.com/jonas-schievink/slicevec
let amount = s.capacity()
let n = range.end;
let m = std::cmp::min(amount,n-1);
// Should use #![feature(inclusive_range_syntax)] here in future.
for j in n-m+1 .. n+1 {
let t = rng.gen_range(range.start,j);
let t = if s.contains(&t) { j-1 } else { t };
s.push( t );
}
}
fn combination_vec(rng: &mut R, range: std::ops::Range<usize>, amount: usize) -> Vec<usize> {
let mut s = Vec::with_capacity((Zero::zero()..amount).len());
unsafe { s.set_len(amount) };
combination_slice(rng,range,s.as_mut_slice());
s
}
fn combination_array<const AMOUNT: usize>(rng: &mut R, range: std::ops::Range<usize>) -> [usize; AMOUNT] {
let mut s = [0usize; AMOUNT];
combination_slice(rng,range,&mut s);
s
}
// Deprecated form:
fn combination_array<A: Array>(rng: &mut R, range: std::ops::Range<usize>) -> A {
let mut s = ArrayVec::<A>::new();
let amount = s.capacity()
combination_slice(rng,range,s.as_mut_slice());
s.into_inner()
}
It becomes tricky with T: Drop
types of course, but SliceVec
seemingly handles them, albeit with unnecessary drops. I suppose allocators might become the preferred mechanism here.
I suppose sample_indices_cache
is kinda meant to integrate Robert Floyds combination algorithm's benefits while avoiding the amount^2 / 2
comparisons by using a HashMap
? It's "running backwards" so that you can output form the main loop though? Assuming that's done correctly then sample_indices_cache
could avoid allocating entirely by becoming an Iterator
itself, no?
pub struct SampleIndicesCache<'a> {
rng: &'a mut R,
length: usize, // or Range<usize>
cache: HashMap<usize>,
count: usize,
}
fn sample_indices_cache<'a>(rng: &mut R, length: usize) -> SampleIndicesCache<'a> {
let cache = HashMap::with_capacity(amount);
SampleIndicesCache { rng, length, cache, count: 0 }
}
impl<'a> Iterator for SampleIndicesCache<'a> {
type Item = usize;
fn next(&mut self) {
let i = self.count;
if i > length { return None; }
self.count += 1;
let j: usize = rng.gen_range(i, self.length);
let tmp = self.cache.remove(&i).unwrap_or(i); // equiv: let tmp = slice[i];
if i == j { return Some(tmp); } // Avoid reinserting or botching return
// let x = self.cache.remove(&j).unwrap_or(j); // equiv: slice[i] = slice[j];
// self.cache.insert(j, tmp); // equiv: slice[j] = tmp;
// It's faster than the previous two lines if we use the Enter API
Some( mem::replace(self.cache.entry(j).or_insert(j), tmp) )
// note that in the inplace version, slice[i] is automatically "returned" value
}
}
After writing this, there was seemingly a mistake in sample_indices_cache
in which j
comes up twice and i==j
the second time, but an early return fixes it.
I'm still vaguely uncomfortable doing all this only for usize
too. Anyone want these for u64
on a 32 bit platform?
@burdges I'm not really sure about the rest of your issue, but generic sample_indices is covered in rust-lang-nursery/rand#202
We might as well provide the ref and mut iterators as well:
pub struct SampleRefsIter<'a,T: 'a> {
slice: &'a [T],
indices: SampleIndicesCache<'a>,
}
fn sample_refs_iter<'a,T>(rng: &'a mut R, slice: &'a [T]) -> SampleRefsIter<'a,T> {
SampleRefsIter { slice, indices: sample_indices_cache(rng,slice.len()) }
}
impl<'a> Iterator for SampleRefsIter<'a,T> {
type item = &'a T;
fn next(&mut self) -> Option<&'a T> {
self.indices.next().map(|x| {
assert!(x < slice.len());
unsafe { self.slice.as_ptr().offset(x) as &T }
})
}
}
pub struct SampleMutsIter<'a,T: 'a>(SampleRefsIter<'a,T>);
fn sample_muts_iter<'a,T>(rng: &'a mut R, slice: &'a mut [T]) -> SampleMutsIter<'a,T> {
SampleMutsIter(sample_refs_iter(rng,slice))
}
impl<'a> Iterator for SampleMutsIter<'a,T> {
type item = &'a mut T;
fn next(&mut self) -> Option<&mut T> {
// In principle, this should be safe because the sample cannot return the same value
// twice, but `mem::transmute` should be replaced with more idiomatic raw pointer calls
// and worse this makes memory safety dependent upon the correctness of `HashMap`.
// I proposed this approach to remain compatible with other mutable iterators, ala
// https://doc.rust-lang.org/src/core/slice/mod.rs.html#1422-1426
// but a bound like where Self: 'a might insulate us from `HashMap` but at the cost of
// some usability, specifically references cannot be kept between iterations
self.0.next().map(|x| unsafe { mem::transmute(x) }).
}
}
We have roughly this functionality:
Some of this functionality is currently in Rng
(see https://github.com/rust-lang-nursery/rand/issues/293), some in other places.
Rough proposal: add a trait something like the following, with impls for both iterators and slices (optimised to each appropriately):
pub trait SampleSeq<T> {
type Item;
// as the current seq::sample:
fn sample<R>(self, n: usize, rng: &mut R) -> Result<Vec<T>, Vec<T>>;
// as above, but with key extraction per element:
fn sample_weighted<R>(self, weight_fn: Fn<Self::Item -> f64>, n: usize, rng: &mut R) -> Result<Vec<T>, Vec<T>>;
// sample but with n==1:
fn choose<R>(self, rng: &mut R);
}
pub fn sample_indices<R>(rng: &mut R, length: usize, n: usize) -> Vec<usize>;
A trait that can be implemented for both slices and iterators seems like a nice idea. But I don't follow it completely.
I am missing sample_slice_ref
, but that is hard to implement for iterators?
Is combining choose
with iterators not a very strange scenario? Do you have an example of what code using that would look like?
And I am not sure what sample_weighted
does...
You might be right about sample_slice_ref
. I wonder if we could just template over both the input type and the output type, i.e. SampleSeq<&[i32], &i32>
, or possibly not.
Choose on iterators (maybe it is strange): https://github.com/rust-lang-nursery/rand/pull/198
I don't know; this needs more thought; I was just mentioning an idea to explore later.
Which nice algorithms do we have? I have excluded WeightedChoice
here, that seems a problem by itself...
Just diving in to figure out the possibilities here.
shuffle
: Fisher-Yates shuffleShuffle the elements of a mutable slice.
choose
: pick one element from a sliceSimply uses gen_range
under the hood. Repeatedly calling this method can sometimes return the same element as before.
We could return a copy/clone of the element, a reference or a mutable reference. I would say the last two are the most useful, giving choose
and choose_mut
as methods, as we have now.
choose_multiple(n)
: pick multiple unique elements from a sliceUses Robert Floyd's algorithm (https://github.com/rust-lang-nursery/rand/pull/144). The order of the picked elements may not be random, but which elements are picked is. To also have a random order call .shuffle()
on the result.
This will either have to return a vector, or have to fill a slice with the results (something to do to support no_std
).
During its operation the algorithm needs to see if the resulting vector already contains a packed value. That would mean all the elements need to be unique for the method to work. So this method should not be used on the elements of slices directly, but only on a range of integers (indices) or on references to the elements.
It seems to me there are two nice ways to implement this algorithm. One as in the PR: the basic combination algorithm, only operating on a range of integers. The resulting list can also be used as indexes into a slice. Maybe call it something like choose_from_range(range, n)
. Or again not use an n
amount, but a slice to fill with numbers, choose_from_range(range, &mut slice)
.
The second could work on a slice directly by creating and comparing references. This would fill a slice with references to the source slice, but I am not shure this works out with lifetimes...
b.t.w. currently seq::sample_indices_cache
uses a hashmap, but I think it can better use Robert Floyd's algorithm.
partial_shuffle(n)
In https://github.com/rust-lang-nursery/rand/issues/169#issuecomment-326428197 a nice design for a function doing a partial Fisher-Yates shuffle came up.
This is an alternative to Robert Floyd's algorithm. It makes a different trade-off: Robert Floyd's algorithm does many (an exponent of n
) comparisons, and the order of the elements is not really random. A partial Fisher-Yates shuffle mutates the source slice, but does not need to do comparisons, and the order of the elements is random. Because it mutates the source slice, it may mean the slice has to be cloned before calling this method. choose_multiple(n)
is fastest for a relatively small number of elements, partial_shuffle(n)
for a relatively large number.
The returned values are two sub-slices of the source slice: one with the partially shuffled elements, one 'scratch space' with the remaining elements.
The current sample_slice
, sample_slice_ref
and sample_iter
can be build with choose_from_range(range, n)
and partial_shuffle(n)
. They may pick the best method and do temporary allocations.
Is it possible to implement methods for iterators similar to those for slices?
shuffle
: inside-out Fisher-Yates shuffleCollect and shuffle the elements of an iterator into a vector. I found no better reference than Wikipedia. We may have to get a little creative to adapt it to Rust and avoid reading initialized memory.
This might be slightly faster than .collect().shuffle()
, but probably not. We would be up against the standard library :smile:, and a specialized collect
implementation.
Maybe confusing and not worth it.
choose_multiple(n)
: pick multiple unique elements from an iteratorAlgorithms that can sample from an unknown number of iterations, and that only go over all the iterations once and sequentially, are known as reservoir sampling algorithms. Algorithm Z from Random Sampling with a Reservoir by Jeffrey Vitter seems to be state of the art here.
Note that the goals here are to only go over the iteration once, not the minimize the number of RNG calls (as Robert Floyd's algorithm that works on slices).
The current sample_iter
uses a much simpler algorithm than Vitter's (it needs to generate a random number for every iteration), but is otherwise similar to choose_multiple(n)
.
I think returning Err<...>
if the iterator turns out to have less then n
elements, as sample_iter
does, is a good idea.
choose
: pick one element from an iteratorThis would just be a specialization of choose_multiple(n)
to only pick one element. A faster alternative to https://github.com/rust-lang-nursery/rand/pull/198.
choose_one_in_n(n)
On average pick every 1 in n
elements. Suppose n = 10
. This may pick two elements right after each other, or skip 40 elements, as long as each occurs with the 'right' probability and the average is 10. Not thought this through yet to be honest...
Edit: this is the only method in this list that can be implemented as an iterator itself.
(Or how I think it would look like, modeled after core::SliceExt
, core:Iterator::last
and Iterator::collect
to the best of my abilities...)
pub trait SliceRandom {
type Item;
fn choose<R: Rng + ?Sized>(&self, rng: &mut R) -> Option<&Self::Item>;
fn choose_mut<R: Rng + ?Sized>(&mut self, rng: &mut R)
-> Option<&mut Self::Item>;
fn shuffle<R: Rng + ?Sized>(&mut self, rng: &mut R);
fn choose_multiple<R: Rng + ?Sized>(&self, rng: &mut R, amount: usize)
-> Vec<Self::Item>;
// FIXME: requires Self::Item is Clone
fn choose_multiple_ref<R: Rng + ?Sized>(&self, rng: &mut R, amount: usize)
-> Vec<&Self::Item>;
// FIXME: needs to couple the lifetimes
fn partial_shuffle<R: Rng + ?Sized>(&mut self, rng: &mut R, amount: usize)
-> (&mut [Self::Item], &mut [Self::Item]);
}
impl<T> SliceRandom for [T] {
type Item = T;
/* ... */
}
pub trait IteratorRandom: Iterator {
fn choose<R>(self, rng: &mut R) -> Option<Self::Item>
where R: Rng + ?Sized
{ /* ... */ }
fn shuffle<R>(self, rng: &mut R) -> Vec<Self::Item>
where R: Rng + ?Sized
{ /* ... */ }
fn choose_multiple<R>(self, rng: &mut R, amount: usize)
-> Result<Vec<Self::Item>, Vec<Self::Item>>
where R: Rng + ?Sized
{ /* ... */ }
fn choose_one_in_n<R, B>(self, rng: &mut R, n: usize) -> Vec<Self::Item>
where R: Rng + ?Sized
{ /* ... */ }
}
pub fn choose_from_range(/* ... */);
An interesting alternative to Robert Floyd's algorithm is Jeffrey Vitter's Efficient Algorithm for Sequential Random Sampling Method D.
It works by going over a list once, picking an element, and than skipping multiple elements with just the right probability. This makes it running time linear, as opposed to Robert Floyd's algorithm that has to do an exponential number of comparisons. So there will be a tipping point where one algorithm is faster than the other.
What makes it interesting/unique API-wise is that it can be used as an iterator over a slice. (Not as an adapter to other iterators though, as it requires the number of elements to be known from the start.) Could be part of the methods on a slice as choose_multiple_iter(n)
.
How does Method D differ from Jeffrey Vitter's Algorithm R used in seq::sample_iter
?
Algorithm Z does not need to know the number of elements it can sample from. It will first fill the destination slice with many items, and then at a decreasing rate start replacing them with elements from further on in the source slice.
Method D needs to know the number of elements it can sample from. It will pick a random interval and copy a first element. Then based on the number of remaining elements it will calculate an interval with the appropriate probability to skip, and then sample a second element. Then a third, until exactly the required number of elements is reached, and roughly the end of the source slice.
So Algorithm Z will do many copy's (some logarithmic of n
), Method D will do only n
copy's. Method D will need to know the number of elements in the source slice, Algorithm Z not. At least, that is my understanding without touching any code :smile:.
How does Method D differ from Jeffrey Vitter's Algorithm R used in
seq::sample_iter
?
O, I am sorry, didn't read your question close enough and confused some things. Algorithm Z is supposed to be an improved Algorithm R, even said to be optimal. But thank you for naming it, I was still searching which one was used there :+1: .
R has to calculate a random value in a range for every element, Z only for some log of them. 3n * ln(N/n)
according to the paper, with N
the total number of elements, and n
the amount to sample.
Some random collected thoughts:
I now think implementing shuffle
for iterators is really not worth it. It is a bit confusing, less flexible and probably not even faster than the alternative.
I don't think returning a result type for choose_multiple(n)
, so an iterator can return an error if it has less than n
elements, is worth it. Checking the length of the Vec
is just as good an indication there are less elements. And is returning less if there is simply nothing more available not just how APIs that work with iterators operate right now? For example .take(5).collect()
does not guarantee you end up with 5 elements.
If we reverse choose_multiple(n)
, we can make it workable with no_std
and not do any allocations: fill a slice of n
elements from a larger slice, or even from an iterator.
fn choose_multiple_from<R: Rng + ?Sized>(&self,
rng: &mut R,
source: &[Self::Item],
amount: usize);
What should partial_shuffle(amount)
do if amount
is greater then the number of elements in the slice? I suppose it should simply shuffle the entire slice, not return some option or error.
Is choose_multiple_ref
a function worth having? Returning a vector of references feels strange to me. Isn't it much better to either generate a vector of indices, or to use the iterator variant with .by_ref()
?
It seems to me reservoir sampling from an iterator can be panellized using Rayon. Now I don't know much about Rayon. Every subdivision needs to sample to it's own reservoir of n
elements. When joining, one reservoir should be formed sample from these two. Not thought this through, but maybe worth exploring.
pub trait SliceRandom {
type Item;
/// Returns a reference to one random element of the slice, or `None` if the
/// slice is empty.
fn choose<R>(&self, rng: &mut R) -> Option<&Self::Item>
where R: Rng + ?Sized;
/// Returns a mutable reference to one random element of the slice, or
/// `None` if the slice is empty.
fn choose_mut<R>(&mut self, rng: &mut R) -> Option<&mut Self::Item>
where R: Rng + ?Sized;
/// Shuffle a slice in place.
///
/// This applies Durstenfeld's algorithm for the [Fisher–Yates shuffle](
/// https://wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle), which produces
/// an unbiased permutation.
fn shuffle<R>(&mut self, rng: &mut R) where R: Rng + ?Sized;
/// Shuffle a slice in place, but exit early.
///
/// Returns two mutable slices from the source slice. The first contains
/// `amount` elements randomly permuted. The second has the remaining
/// elements that are not fully shuffled.
///
/// This is an efficient method to select `amount` elements at random from
/// the slice, provided the slice may be mutated.
///
/// If you only need to chose elements randomly and `amount > self.len()/2`
/// then you may improve performance by taking
/// `amount = values.len() - amount` and using only the second slice.
///
/// If `amount` is greater than the number of elements in the slice, this
/// will perform a full shuffle.
fn partial_shuffle<R>(&mut self, rng: &mut R, amount: usize)
-> (&mut [Self::Item], &mut [Self::Item]) where R: Rng + ?Sized;
/// Produces an iterator that chooses `amount` elements from the slice at
/// random without repeating any.
///
/// TODO: This may internally use Jeffrey Vitter's algorithm D,
/// Robert Floyd's algorithm, or some other method.
///
/// TODO: Do we want to guarantee the choosen elements are sequential?
fn choose_multiple<R>(&self, rng: &mut R, amount: usize)
-> ReservoirSampler<R> where R: Rng + ?Sized;
/// Fill a mutable slice with elements choosen at random from a finite
/// iterator.
///
/// Uses Alan Waterman's algorithm R, or Jeffrey Vitter's algorithm Z.
///
/// Note: only algorithm R is available if the `std` feature is not enabled,
/// because algorithm Z needs `f64::log` and `f64::exp`.
///
/// TODO: What to do if the iterator does not produce enough elements to
/// fill the slice?
fn choose_multiple_from<R, I>(&mut self, rng: &mut R, iterable: I);
where R: Rng + ?Sized, I: IntoIterator<Item=Self::Item>;
}
impl<T> SliceRandom for [T] {
type Item = T;
/* ... */
}
impl<R> Iterator for ReservoirSampler<R> where R: Rng + ?Sized {
/* ... */
}
pub trait IteratorRandom: Iterator {
/// Choose one element at random from the iterator.
///
/// FIXME: is it ever possible for an iterator to return no elements?
fn choose<R>(self, rng: &mut R) -> Option<Self::Item>
where R: Rng + ?Sized
{ /* ... */ }
/// Collects `amount` values at random from the iterator into a vector.
///
/// Uses Alan Waterman's algorithm R, or Jeffrey Vitter's algorithm Z.
///
/// Note: only algorithm R is available if the `std` feature is not enabled,
/// because it algorithm Z needs `f64::log` and `f64::exp`.
#[cfg(any(feature="std", feature="alloc"))]
fn choose_multiple<R>(self, rng: &mut R, amount: usize) -> <Vec<Self::Item>
where R: Rng + ?Sized
{ /* ... */ }
}
/// Choose `amount` elements from `0..range` using [Robert Floyd's algorithm](
/// http://fermatslibrary.com/s/a-sample-of-brilliance).
///
/// It only makes `amount` calls to the random number generator instead of
/// `range` calls. It does `amount^2 / 2` extra comparisons however.
pub fn choose_from_range<R>(rng: &mut R, range: usize, amount: usize)
where R: Rng + ?Sized;
As it turns out, the iterator methods don't add much, but they may be ergonomic.
With partial_shuffle
, choose_multiple
, choose_multiple_from
and choose_from_range
we get four different algorithms to uniquely sample multiple elements with different trade-offs. And we can do so from ranges, slices and iterators.
@dhardy What do you think of methods like this? I think they are a little less easy to use than the current methods in the seq
module, but more flexible, and also usable withoud std
.
(I am still working on weighted sampling, but that is an area with many different choices)
Uh, I was trying to ignore this until after 0.5; can it wait at least until we decide exactly what needs to be done before the 0.5 release?
For uniform sampling we had two primary methods: sampling with replacement, and sampling without replacement. With replacement could be implemented most naturally with a method that just picks one result, choose
. Without replacement needs to know how many elements to pick, with functions such as choose_multiple
.
For weighted sampling we have three methods:
How to think about the difference between 'with replacement' and 'without replacement'? Imagine a population of 10 items, the first with a weight 10.0, the rest with weight 1.0. When picking 4 elements, do you expect
The linked paper has a two examples to illustrate the difference between 'defined probabilities' and 'defined weights'. One way that helped me to thing about it:
From the paper:
Example 1. Assume that we want to select a weighted random sample of size m = 2 from a population of n = 4 items with weights 1, 1, 1 and 2, respectively. For problem WRS-N-P the probability of items 1, 2 and 3 to be in the random sample is 0.4, whereas the probability of item 4 is 0.8. For WRS-N-W the probability of items 1, 2 and 3 to be in the random sample is 0.433, while the probability of item 4 is 0.7.
The algorithm for reservoir sampling with defined probabilities is the method by M. T. Chao from 1982 A general purpose unequal probability sampling plan. And the algorithm using defined weights is by Efraimidis and Spirakis from 2005 Weighted Random Sampling.
Further there is one constant for all weighted sampling methods: they all have to iterate over the entire slice(/iterator) at least once, because the sum of all the probabilities must be known.
For weighted sampling with replacement we may use simpler methods. For example, it is possible to do a one time set-up that calculates the sum of all weights. Than generate a random number in that range, and iterate over all weights unil the sum is greater than the random value. (Somewhat similar to the simple choose
method for uniform sampling). This can be optimised by pre-computing all the sums of the weights. With WeightedChoice
we currently have a simple implementation of this method that uses integers.
This just provides an overview the involved choices. The real choice of algorithms, how to 'bind' weights to elements, and what API to present is left to explore.
And something else, that seems somewhat related to weighted sampling:
It is possible to sample n
elements by taking a constant interval sum/n
and only a random starting point. This is known as stochastic universal sampling, and implemented in the Rust Random Choice crate. This picture illustrates how it works. In theory it can be implemented as an efficient iterator that returns n
elements from a slice. There are situations where this method can be the right choice, but it is very different from the weighted random sampling discussed here.
The more you read, the more improvements to the algorithms you find. But I think I now have collected the current 'best' for uniform sampling. People seem very creative when coming up with algorithm names...
It began with Knuth in Art of Computer Programming vol. II, where he used the names algorithm S for a selection sampling technique, and algorithm R for reservoir sampling.
Selection sampling Algorithm S is attributed by Knuth to C. T. Fan, Mervin E. Muller, and Ivan Rezucha, and was also independently discovered by T. G. Jones (1962).
Robert Floyd improved upon it with algorithm F, and also came up with a variant that guarantees the collected samples are permuted as algorithm P. (A sample of brilliance, 1987)
Algorithm S is usually faster than F, because it is not necessary to recompute the acceptable zone for every changing range. But S performs very bad when the numbers to sample gets close to the number of available elements. So it is probably best to switch between both, S when the number of items to sample is less than something like 25%, and P otherwise.
Reservoir sampling Algorithm R is attributed to Alan Waterman.
Jeffrey Vitter improved upon it with algorithm X, Y and Z, which need less values from the RNG. (Random Sampling with a Reservoir, 1985)
Kim-Hung Li introduced algorithm K, L, and M, which are similar to algorithm Z but have different trade-offs. Of the four L seems to be the fastest choice for most situations. (Reservoir-Sampling Algorithms of Time Complexity O(n(1 + log(N/n))), 1994)
Sequential random sampling In Sequential random sampling (1985) J. H. Ahrens and U. Dieter showed several variants to algorithm S. They renamed S to SU, and introduced SG and SG* as a geometric method, and SH as a method employing hashing techniques. They all have a small disadvantage, in that they need more memory then the reservoir size.
Jeffrey Vitter introduced algorithm A and D (Edit: and B and C in-between) to efficiently sample from a known number of elements sequentially, without needing extra memory. (Faster Methods for Random Sampling, 1984 and Efficient Algorithm for Sequential Random Sampling, 1987)
K. Aiyappan Nair improved upon it with algorithm E. (An Improved Algorithm for Ordered Sequential Random Sampling, 1990)
As I understand it, algorithm D falls back on A if that is faster, and E falls back to A in more situations. So E is the best choice. E is expected to have similar performance to SG*.
So we have two methods for weighted random sampling without replacement. In the literature they are known as A-ES and A-Chao. They have a subtle difference, in that one takes the weights as relative to the total weight of all items ('using defined weights'), and the other as relative to the remaining items ('using defined probabilities'). Should we support both in Rand?
A few points to consider:
I think we can say that for most users, A-ES is the algorithm they expect.
Do we want to explain the difference, and provide users with a choice? I think a small explanation is good, together with naming both algorithms. But we do not have to include every possible algorithm in Rand. There is no problem implementing A-Chao in another crate. Weighted sampling without replacement is good to have, yet including two with subtle differences seems like too much.
I would suggest Rand only implements A-ES.
Implemented Vitter's method D today (although there might still be a couple of bugs in it...). At least enough to see how it works and performs.
Jeffrey Vitter produced two papers about it. One in 1984 that details the theory behind it, and all the possibilities for optimization. Very handy :smile:. The second paper from 1987 is much shorter and does not add all that much. I translated the Pascal-like code from the second paper to Rust.
The loops map very cleanly to an iterator implementation in Rust. Also our exponential distribution (Ziggurat method) considerably improved performance. But because method D has to use ln()
and exp()
a couple of times, it is not possible to use with just core
, only with std
. Edit: Well, the full method D is not available, but the often a bit slower method A can be used with no_std
.
Performance is very comparable to the current sample_slice
and sample_slice_ref
. Sometimes a bit faster, sometimes a bit slower, usually within 10%. I was partly disappointed by the performance, but another way to look at it is that the current implementation of sample_slice
is pretty good.
Method D has a few small advantages:
pub trait SliceRandom {
type Item;
/* ... */
/// Produces an iterator that chooses `amount` elements from the slice at
/// random without repeating any, in sequential order.
fn pick_multiple<'a, R: Rng>(&'a self, amount: usize, rng: &'a mut R)
-> RandomSampler<'a, Self::Item, R>;
// TODO: probably a pick_multiple_mut can also be useful
}
pub struct RandomSampler<'a, T: 'a, R: Rng + 'a> {
/* ... */
}
impl<'a, T: 'a, R: Rng + 'a> Iterator for RandomSampler<'a, T, R> {
type Item = &'a T;
fn next(&mut self) -> Option<&'a T> {
/* ... */
}
}
/* Example of using pick_multiple to sample 20 values from a slice with 100 integers */
let mut r = thread_rng();
let vals = (0..100).collect::<Vec<i32>>();
let small_sample: Vec<_> = vals.pick_multiple(20, &mut r).collect();
println!("{:?}", small_sample);
I also tried to implement the suggested improvements in the paper from Nair (1990) as algorithm E. He claims that by doing an extra calculation, is is possible to do the relatively slow Method D less. And that is true. His method calculates a lower and an upper bound, and if they are the same, no extra calculations are necessary.
But the extra work to compute the bounds for every iteration are much more work! His paper does not include any measurements. It actually makes the algorithm ~50% slower. (Edit: I may of course have messed up the implementation, but I think not because it are only a couple of straigt-forward lines). I also see no way how his proposed improvement to Algorithm A can make it any faster. Both where nice insights, but not practical.
What's the current status here? I'm quite interested to jump in and help.
Is it still up for discussion which algorithms to implement and what API to use?
What's the current status here? I'm quite interested to jump in and help.
Great! there is not much more current status than what you see here. Mostly just my explorations from two months ago. Now that the 0.5 release is around the corner, we can start looking again at new features.
Is it still up for discussion which algorithms to implement and what API to use?
Yes, completely. What do you think of my proposal for an API in https://github.com/dhardy/rand/issues/82#issuecomment-373927205? My effort at implementing in (still in rough shape) lives in https://github.com/pitdicker/rand/blob/slice_random/src/seq.rs.
Weighted sampling is still a bit of an open problem. I was thinking that an API taking two iterators, one with the data and one with the weights, would be most flexible, but don't have anything concrete yet.
Feel free to come up with any different ideas, or take code, as you want.
Cool!
I think that API for weighted sampling makes a lot of sense. I imagine something like (with one of these arguments passed as self
):
fn pick_weighted<IterItems, IterWeights, R>(items: IterItems,
weights: IterWeights,
rng: &mut R)
-> IterItems::Item where IterItems: Iterator,
IterWeights: Iterator+Clone,
IterWeights::Item: SampleUniform+From<u8>+Sum<IterWeights::Item> {
let total_weight: IterWeights::Item = weights.clone().sum();
pick_weighted_with_total(items, weights, total_weight, rng)
}
fn pick_weighted_with_total<IterItems, IterWeights, R>(items: IterItems,
weights: IterWeights,
total_weight: IterWeights::Item,
rng: &mut R)
-> IterItems::Item where IterItems: Iterator,
IterWeights: Iterator,
IterWeights::Item: SampleUniform+From<u8> {
...
}
I'm happy to provide a patch for this.
On a separate, and more general, topic regarding API design, I'm trying to understand why the proposals tend to stick these functions on [T]
and Iterator
rather than on Rng
?
I.e. why vec.shuffle(rng)
rather than rng.shuffle(vec)
?
I certainly agree that sticking functions on existing types generally results in nicer APIs than creating global functions, i.e. vec.shuffle(rng)
would be a lot nicer than shuffle(vec, rng)
. And in instances when there's no other object to tie an operation to then it certainly seems like the right thing to do, i.e. vec.shuffle()
is a lot nicer than shuffle(vec)
.
However between vec.shuffle(rng)
and rng.shuffle(vec)
I see a few advantages with the latter.
Rng
trait would show all operations that objects implementing Rng
can be used for. vec.choose_multiple(rng, 10)
on github or elsewhere, it's less clear where to look for documentation for the choose_multiple
function.choose_weighted
, which operates on two iterators, or choose_multiple_from
, which operates on an iterator and a slice, it feels quite arbitrary which of the objects should be passed as self
and which should be passed as a normal argument. Using the Rng
as self
can be done consistently.But I admit I'm also strongly biased by coming from C++ which for obvious reasons doesn't have a tradition of libraries extending existing types.
And there are a couple of good arguments for the vec.shuffle(rng)
style that I can think of:
iter.pick(rng)
and vec.pick(rng)
, rather than have to use rng.pick(vec)
and rng.pick_from_iter(iter)
.get_a_vector(foo).pick(rng).do_some_stuff()
is cleaner than rng.pick(get_a_vector(foo)).do_some_stuff()
.Are there any general rust guidelines around this? Should we leave particular topic to a separate issue?
There's this https://rust-lang-nursery.github.io/api-guidelines/ idk how much is applicable though
I think I've roughly caught up on this, except for weighted sampling and not looking into all the algorithms...
I quite like @pitdicker's proposed API, except that I would move SliceRandom::choose_multiple_from
to IteratorRandom::choose_multiple_fill
.
One thing that stands out is that choose
and choose_multiple
are basically just wrappers around gen_range
and sample_indices
, so it doesn't matter much if they don't cover all uses perfectly (though it may still be worth having _mut
versions). I think on iterators, choose_multiple
can just be a wrapper around choose_multiple_fill
.
Another point is that with iterators, if the exact length is known, much more efficient implementations are possible. But I guess this situation is unlikely, so users can make their own wrappers around gen_range
/ sample_indices
in this case.
Is it actually possible to implement partial_shuffle
significantly more efficiently than with choose_multiple
and shuffle
? I saw the neat variant of Floyd's algorithm, but it requires insertions at arbitrary positions which is generally not efficient.
SliceRandom::choose_multiple_iter
sounds like an interesting extra, though if performance is good enough we could just use this instead of returning a Vec for choose_multiple
. Note that the sample_indices
approach should also map well to an iterator over output values.
A lot of this requires a good sample_indices
function, and I see there are a lot of possible algorithms we could use (much more than rust-lang-nursery#479). It may or may not be worth generalising this (provided it doesn't compromise performance): allow lower-bound other than 0, and be generic over types (probably T: From<u8>
).
choose_one_in_n
: as I understand this is simply
iter.filter(|_| rng.gen_bool(p))
, hence does not appear an important addition.
I.e. why vec.shuffle(rng) rather than rng.shuffle(vec)?
Also something I was thinking about. Your rationales are important; a couple more:
Rng::gen
the data is source is the RNG but for choose
and shuffle
the source is the slice/iterator.I think at least methods consuming iterators should put the methods on the iterator, i.e. iter.choose_multiple(&mut rng, n)
.
For slices I'm less sure. (We can even have both as with Distribution::sample
and Rng::sample
, though probably this isn't desirable). Of course having iter.choose(&mut rng)
but rng.choose(slice)
is a bit confusing for users, so consistency is probably the best option. This leaves the question of whether we should deprecate Rng::choose
and Rng::shuffle
.
Method calls typically follow the data...
Having noodled on this for a few days, I can certainly see logic in having Rng only have functions which generate values, and putting functions which operate on values in slices and iterators in extension traits for those types.
However that does leave APIs like choose_weighted
which take two iterators. Randomly picking one of those iterators as self
doesn't feel very nice. If choose_weighted
(and choose_weighted_with_total
) is the only instance APIs that operate on multiple data sources, then ultimately it's probably ok that this one is a bit ugly. But of course it's nice if we make it as good as it can be.
But if we go with this approach I think it's important that we only stick functions like shuffle
and choose
on SliceRandom
. If some functions which operate on slices live on Rng
and some don't, then it's very easy miss that the ones don't exist at all.
And I do still have concerns around discoverability and documentation...
If we go with this approach, I agree with @dhardy that we should do IteratorRandom::choose_multiple_fill
rather than SliceRandom::choose_multiple_from
FWIW, I'd really like to come to some sort of conclusion here. I'd enjoy writing up more patches for sequence-related functions, but don't want to waste my time on PRs that aren't going to go anywhere due to using the wrong API.
https://github.com/rust-lang-nursery/rand/pull/483 was my attempt at drawing some kind of conclusion from this. It's more complicated than it should be (I need to clean up the commits), but essentially just waiting for feedback on the basic design stuff.
@pitdicker: Is the code for sampling multiple values at https://github.com/pitdicker/rand/blob/slice_random/src/seq.rs working? I.e. is RandomSampler
and SequentialRandomSampler
working? I wanted to try to benchmark it against other algorithms for multi-sampling.
Is the code for sampling multiple values at pitdicker/rand:src/seq.rs@slice_random working?
It have tested it, but only lightly. But I have re-checked multiple times that it is implemented exactly as in the paper.
b.t.w. @dhardy @sicking I have enjoyed working with you on rand, but it is a bit more difficult to find the time at the moment (you probably have noticed). Feel free to cc me with @pitdicker on anything, and hopefully I can do more after the summer.
No problem @pitdicker; thank you for all the work!
Thanks for the work @pitdicker! I've enjoyed working with you a lot.
I'm actually also likely to have less time. I'm starting a new job next week so will have significantly less free time on my hands. I still have a to-do list of items that I want to get done, and hope to be able to get through it before the 1.0 release. So for now hopefully I'll still be able to stay involved.
I believe we've finally wrapped most of this up.
This part of
rand
has needed work for a while, and hasn't had much love recently, excepting the excellent work onseq::sample*
by @vitiral.WeightedChoice
is a distribution for sampling one item from a slice, with weights. The current API only accepts a slice (reference) and clones objects when sampling.The
seq
module has functions to:Issues to solve:
WeightedChoice
andseq::*
in completely separate modules, the first as a distribution the second as simple functions?rand
?References:
seq::*
sample_indicies
WeightedChoice
is hard to use with borrow checkerRandomChoice
implRng::pick
/Choose
WeightedChoice
owning, addsChoose
, some reorganisation)I'm going to say nothing is fixed at this point. In theory it would be nice not to break the new
seq
code.