Closed pitdicker closed 6 years ago
I am seeing some improvements in other benchmarks too:
Before:
test distr_uniform_ascii_char ... bench: 12,196 ns/iter (+/- 222) = 327 MB/s
test misc_sample_indices_100_of_1k ... bench: 1,402 ns/iter (+/- 1)
test misc_sample_indices_10_of_1k ... bench: 528 ns/iter (+/- 0)
test misc_sample_indices_50_of_1k ... bench: 780 ns/iter (+/- 1)
test misc_sample_iter_10_of_100 ... bench: 1,266 ns/iter (+/- 3)
test misc_sample_slice_10_of_100 ... bench: 182 ns/iter (+/- 0)
test misc_sample_slice_ref_10_of_100 ... bench: 180 ns/iter (+/- 0)
After:
test distr_uniform_ascii_char ... bench: 3,713 ns/iter (+/- 4) = 1077 MB/s
test misc_sample_indices_100_of_1k ... bench: 669 ns/iter (+/- 1)
test misc_sample_indices_10_of_1k ... bench: 437 ns/iter (+/- 1)
test misc_sample_indices_50_of_1k ... bench: 399 ns/iter (+/- 0)
test misc_sample_iter_10_of_100 ... bench: 946 ns/iter (+/- 7)
test misc_sample_slice_10_of_100 ... bench: 142 ns/iter (+/- 0)
test misc_sample_slice_ref_10_of_100 ... bench: 140 ns/iter (+/- 0)
Just added a little optimisation to ascii_word_char
. It's range is close enough to a power of two that a bitmask is more efficient.
test distr_uniform_ascii_char ... bench: 2,754 ns/iter (+/- 11) = 1452 MB/s
With choose
I see a optimisation, but don't know if it is possible. It needs an usize
in some range to use as slice index. on x86_64 this is the same as an u64
, which is twice as slow as u32
. It would help if we could assume the slice length < 2^32. Would it be ok to document: "Choose
will only pick from the first 2^32 values of a slice"?
Hmm, not sure how to answer this. I'm not so happy with the use of usize
for container sizes everywhere since for the vast majority of uses u32
is sufficient, but it's what we have, and it seems odd doing something different here.
With i128_support
on x86_64 a range reduction on u64
should be about as fast as on u32
.
For cryptographic RNG's generating u64
's instead of u32
's should be about 50~60 percent slower. It is not 100% because of (thanks to) the overhead of reading from the buffered the results. At least with an improvement I made for HC-128, that should also work for ISAAC and ChaCha.
Small fast RNG's mostly need to operate on 64-bit words for good statistical quality. There are some really fast ones where on x86_64 generating u64
's has the same excellent performance as generating u32
's.
So on second thought using usize
on x86_64 should eventually have similar performance as using an u32
as index. We should just leave choose
as it is, without subtleties.
Now I wonder if the optimization of ascii_word_char
is worth it. It is only about 10% faster then when we have better RNG's and i128_support
.
I do like that it removes a complex dependency between Uniform
, sequences
, Sample
and Range
. Before the chain looked like: Uniform::ascii_word_char
→ sequences::Choose
→ Sample::gen_range
→ Range::sample::single
. I think it makes some sense to end up with distributions that only depend on rand_core
(and std
/core
). Edit: and utils::FloatConversions
.
Yes, the optimisation/standalone impl for ascii_word_char
is a good idea in my opinion.
Not sure what you mean here?
Also I tried to make sample_single have a default implementation, like in RangeFloat, but couldn't get it to work.
It would be nice to have something like this as a default implementation:
fn sample_single<R: Rng+?Sized>(low: Self::X, high: Self::X, rng: &mut R) -> Self::X {
let range: Self = RangeImpl::new(low, high);
range.sample(rng)
}
Then we would not have to add this code when there is no faster / specialized method available for some type (like the one in the tests of this module). But I could not get it to work because Rust can't figure out the types (and neither could I...).
Ah. It's actually quite simple:
error[E0277]: the trait bound `Self: core::marker::Sized` is not satisfied
--> src/distributions/range.rs:107:13
|
107 | let range: Self = RangeImpl::new(low, high);
| ^^^^^ `Self` does not have a constant size known at compile-time
What the compiler is saying is that you can only construct sized types; here you are trying to construct range
of type RangeImpl
, but it doesn't know that the RangeImpl
trait can only be implemented for Sized
types (e.g. you could try to impl RangeImpl for [u8]
). You can fix this by specifying trait RangeImpl: Sized { ... }
; this restricts implementations to sized types.
I can't really imagine a RangeImpl
for slices, so a Sized
bound should not really be a restriction. Thank you, 20 more errors and I will start to get a feeling for traits :smile:. Will add a commit tomorrow.
I think this is ready now
Great, thanks!
I'm going to assume that the sampling is correct; I glanced over the code but there's so many routines just for ranges now; then there are all the other distributions. I'm thinking an external tool to plot 10'000+ samples for visual inspection might be useful, plus some "synthetic tests" (checking a few inputs produce the expected values).
Now that you mention it, I did not add any tests... The tests for choose
cover testing if some values give plausible results, and if the values are in range.
No, properly testing distributions would appear to be much more difficult than writing the distribution code.
Yes, I was thinking about that after your comment. A simple test could be to generate a lot of numbers, fill something like 256 buckets, and compare if they roughly follow the distribution. But that would only catch the worst errors. Shall I open an issue?
As discussed in https://github.com/dhardy/rand/pull/68.
Benchmarks before (with
i128_support
):After:
And the normal code with a one-time set-up cost for comparison:
The performance of
sample_single
depends quite a bit on how close the range is to a power of two, but I have not investigated the details... Performance seems worth the extra method.I am not sure the order of arguments is the best choice, with
(low, high, rng)
. Would(rng, low, high)
be better? Also I tried to makesample_single
have a default implementation, like inRangeFloat
, but couldn't get it to work.