Add sample_single to Range

pitdicker commented 6 years ago

As discussed in https://github.com/dhardy/rand/pull/68.

Benchmarks before (with i128_support):

test gen_range_i8             ... bench:       6,912 ns/iter (+/- 6) = 144 MB/s
test gen_range_i16            ... bench:       6,491 ns/iter (+/- 11) = 308 MB/s
test gen_range_i32            ... bench:       7,274 ns/iter (+/- 48) = 549 MB/s
test gen_range_i64            ... bench:      11,476 ns/iter (+/- 36) = 697 MB/s
test gen_range_i128           ... bench:     279,477 ns/iter (+/- 552) = 57 MB/s

After:

test gen_range_i8             ... bench:       5,350 ns/iter (+/- 5) = 186 MB/s
test gen_range_i16            ... bench:       5,347 ns/iter (+/- 3) = 374 MB/s
test gen_range_i32            ... bench:       3,922 ns/iter (+/- 22) = 1019 MB/s
test gen_range_i64            ... bench:       8,228 ns/iter (+/- 35) = 972 MB/s
test gen_range_i128           ... bench:     148,177 ns/iter (+/- 322) = 107 MB/s

And the normal code with a one-time set-up cost for comparison:

test distr_range_i8           ... bench:       2,685 ns/iter (+/- 27) = 372 MB/s
test distr_range_i16          ... bench:       2,866 ns/iter (+/- 29) = 697 MB/s
test distr_range_i32          ... bench:       3,361 ns/iter (+/- 53) = 1190 MB/s
test distr_range_i64          ... bench:       3,321 ns/iter (+/- 64) = 2408 MB/s
test distr_range_i128         ... bench:     140,754 ns/iter (+/- 1,327) = 113 MB/s

The performance of sample_single depends quite a bit on how close the range is to a power of two, but I have not investigated the details... Performance seems worth the extra method.

I am not sure the order of arguments is the best choice, with (low, high, rng). Would (rng, low, high) be better? Also I tried to make sample_single have a default implementation, like in RangeFloat, but couldn't get it to work.

pitdicker commented 6 years ago

I am seeing some improvements in other benchmarks too:

Before:

test distr_uniform_ascii_char ... bench:      12,196 ns/iter (+/- 222) = 327 MB/s

test misc_sample_indices_100_of_1k   ... bench:       1,402 ns/iter (+/- 1)
test misc_sample_indices_10_of_1k    ... bench:         528 ns/iter (+/- 0)
test misc_sample_indices_50_of_1k    ... bench:         780 ns/iter (+/- 1)
test misc_sample_iter_10_of_100      ... bench:       1,266 ns/iter (+/- 3)
test misc_sample_slice_10_of_100     ... bench:         182 ns/iter (+/- 0)
test misc_sample_slice_ref_10_of_100 ... bench:         180 ns/iter (+/- 0)

After:

test distr_uniform_ascii_char ... bench:       3,713 ns/iter (+/- 4) = 1077 MB/s

test misc_sample_indices_100_of_1k   ... bench:         669 ns/iter (+/- 1)
test misc_sample_indices_10_of_1k    ... bench:         437 ns/iter (+/- 1)
test misc_sample_indices_50_of_1k    ... bench:         399 ns/iter (+/- 0)
test misc_sample_iter_10_of_100      ... bench:         946 ns/iter (+/- 7)
test misc_sample_slice_10_of_100     ... bench:         142 ns/iter (+/- 0)
test misc_sample_slice_ref_10_of_100 ... bench:         140 ns/iter (+/- 0)

pitdicker commented 6 years ago

Just added a little optimisation to ascii_word_char. It's range is close enough to a power of two that a bitmask is more efficient.

test distr_uniform_ascii_char ... bench:       2,754 ns/iter (+/- 11) = 1452 MB/s

With choose I see a optimisation, but don't know if it is possible. It needs an usize in some range to use as slice index. on x86_64 this is the same as an u64, which is twice as slow as u32. It would help if we could assume the slice length < 2^32. Would it be ok to document: "Choose will only pick from the first 2^32 values of a slice"?

dhardy commented 6 years ago

Hmm, not sure how to answer this. I'm not so happy with the use of usize for container sizes everywhere since for the vast majority of uses u32 is sufficient, but it's what we have, and it seems odd doing something different here.

pitdicker commented 6 years ago

With i128_support on x86_64 a range reduction on u64 should be about as fast as on u32.

For cryptographic RNG's generating u64's instead of u32's should be about 50~60 percent slower. It is not 100% because of (thanks to) the overhead of reading from the buffered the results. At least with an improvement I made for HC-128, that should also work for ISAAC and ChaCha.

Small fast RNG's mostly need to operate on 64-bit words for good statistical quality. There are some really fast ones where on x86_64 generating u64's has the same excellent performance as generating u32's.

So on second thought using usize on x86_64 should eventually have similar performance as using an u32 as index. We should just leave choose as it is, without subtleties.

pitdicker commented 6 years ago

Now I wonder if the optimization of ascii_word_char is worth it. It is only about 10% faster then when we have better RNG's and i128_support.

I do like that it removes a complex dependency between Uniform, sequences, Sample and Range. Before the chain looked like: Uniform::ascii_word_char → sequences::Choose → Sample::gen_range → Range::sample::single. I think it makes some sense to end up with distributions that only depend on rand_core (and std/core). Edit: and utils::FloatConversions.

dhardy commented 6 years ago

Yes, the optimisation/standalone impl for ascii_word_char is a good idea in my opinion.

Not sure what you mean here?

Also I tried to make sample_single have a default implementation, like in RangeFloat, but couldn't get it to work.

pitdicker commented 6 years ago

It would be nice to have something like this as a default implementation:

fn sample_single<R: Rng+?Sized>(low: Self::X, high: Self::X, rng: &mut R) -> Self::X {
    let range: Self = RangeImpl::new(low, high);
    range.sample(rng)
}

Then we would not have to add this code when there is no faster / specialized method available for some type (like the one in the tests of this module). But I could not get it to work because Rust can't figure out the types (and neither could I...).

dhardy commented 6 years ago

Ah. It's actually quite simple:

error[E0277]: the trait bound `Self: core::marker::Sized` is not satisfied                                                             
   --> src/distributions/range.rs:107:13                                                                                               
    |                                                                                                                                  
107 |         let range: Self = RangeImpl::new(low, high);                                                                             
    |             ^^^^^ `Self` does not have a constant size known at compile-time

What the compiler is saying is that you can only construct sized types; here you are trying to construct range of type RangeImpl, but it doesn't know that the RangeImpl trait can only be implemented for Sized types (e.g. you could try to impl RangeImpl for [u8]). You can fix this by specifying trait RangeImpl: Sized { ... }; this restricts implementations to sized types.

pitdicker commented 6 years ago

I can't really imagine a RangeImpl for slices, so a Sized bound should not really be a restriction. Thank you, 20 more errors and I will start to get a feeling for traits :smile:. Will add a commit tomorrow.

pitdicker commented 6 years ago

I think this is ready now

dhardy commented 6 years ago

Great, thanks!

I'm going to assume that the sampling is correct; I glanced over the code but there's so many routines just for ranges now; then there are all the other distributions. I'm thinking an external tool to plot 10'000+ samples for visual inspection might be useful, plus some "synthetic tests" (checking a few inputs produce the expected values).

pitdicker commented 6 years ago

Now that you mention it, I did not add any tests... The tests for choose cover testing if some values give plausible results, and if the values are in range.

dhardy commented 6 years ago

No, properly testing distributions would appear to be much more difficult than writing the distribution code.

pitdicker commented 6 years ago

Yes, I was thinking about that after your comment. A simple test could be to generate a lot of numbers, fill something like 256 buckets, and compare if they roughly follow the distribution. But that would only catch the worst errors. Shall I open an issue?

dhardy / rand

Add sample_single to Range #69