Closed pitdicker closed 6 years ago
I fixed the spelling in the comments, thank you.
About how or where to add a constructor: there are so many designs and questions in the RFC thread, I don't know where to start. And I now almost nothing about API design... Probably best to leave some comments on the RFC tread first
I'd rather address other stuff (the traits and crate split) on the RFC thread first; there's too much going on there at the moment.
Using inclusive ranges seems sensible to me but would be a breaking change, so maybe a second constructor/function.
Sorry for derailing the discussion on the RFC tread, I saw your comment to late.
I sure don't think changing the default from open to closed ranges is a good idea! It would be very easy to silently break the code of others. Even with good documentation.
I had two idea's that should help make the range code a bit faster in some cases:
RangeInt
struct, thanks to a trick: the bits of the u32 that get truncated, happen to all be 1's.I'll report back when I have something.
I spend a few more hours trying to figure out why to code runs slower than it should. Without much success...
It turns out the modulus operation is slow, but not the real problem. Apparently rust adds a check before dividing by 0, and possibly panics. This slowed down the function by about 5%. If I leave out the loop from sample
, it is 40% faster. And that while the loop almost always (99+%) runs just once.
The idea to pick a larger zone for u8 and u16 worked out nicely. But it got a little complicated with the rules for casting. The idea of providing two different techniques produced some nice and complicated code, but was not any faster. Benchmarks: Before:
test distr_range2_i16 ... bench: 2,513 ns/iter (+/- 51) = 397 MB/s
test distr_range2_i8 ... bench: 2,941 ns/iter (+/- 30) = 340 MB/s
test distr_range2_int ... bench: 3,104 ns/iter (+/- 37) = 2577 MB/s
test distr_range_int ... bench: 3,102 ns/iter (+/- 29) = 2578 MB/s
After:
test distr_range2_i16 ... bench: 1,238 ns/iter (+/- 24) = 807 MB/s
test distr_range2_i8 ... bench: 1,231 ns/iter (+/- 34) = 812 MB/s
test distr_range2_int ... bench: 2,961 ns/iter (+/- 55) = 2701 MB/s
test distr_range_int ... bench: 3,120 ns/iter (+/- 43) = 2564 MB/s
I have added a function new_inclusive
to RangeImpl
, but not exposed the closed range methods further.
Rebased. You found the strange performance difference I have been searching for for hours: the compiler became to smart with the benchmarks :-). Now it all looks a lot less nice...
Before:
test distr_range2_i8 ... bench: 5,706 ns/iter (+/- 52) = 175 MB/s
test distr_range2_i16 ... bench: 4,699 ns/iter (+/- 33) = 425 MB/s
test distr_range2_i32 ... bench: 5,325 ns/iter (+/- 46) = 751 MB/s
test distr_range2_i64 ... bench: 11,080 ns/iter (+/- 26) = 722 MB/s
After:
test distr_range2_i8 ... bench: 4,863 ns/iter (+/- 38) = 205 MB/s
test distr_range2_i16 ... bench: 4,861 ns/iter (+/- 38) = 411 MB/s
test distr_range2_i32 ... bench: 5,426 ns/iter (+/- 50) = 737 MB/s
test distr_range2_i64 ... bench: 10,925 ns/iter (+/- 24) = 732 MB/s
Some improvements, some losses. But it all depends very much on the size of the range, at least for i32 and i65.
It still think this pr is useful, as it adds support for closed ranges (e.g. handling ranges that can cover the entire range of the type). And the optimisation of small integers and extra comments.
Yes, I had my head scratching why moving benchmarks from one module to another made a big difference, until I realised one used blackbox
. Micro-benchmarks are tricky.
Can you add a Range::new_inclusive
constructor?
Finally finished this. Sorry for taking so long.
Would it be okay if I make a PR that removes Range
and replaces it with Range2
?
No problem.
Yes, I was planning on removing the original range
; actually this PR is the reason I didn't yet. Go ahead.
Thank you. Removed the original range
.
Hi @dhardy. Is there a chance that you could merge this and the other PRs, or that we can move them along?
Yeah, I guess. Sorry, I've been busy with some other work and travel the last couple of weeks, should have more time now.
Thank you!
This does not compile, but I wanted to get your opinion.
The last few days I tried just about every trick to make sampling from a range of integers faster. Like making the zone a power of 2 (so mod becomes &), and multiplication with mersenne numbers. In the end, you are just trading the modulus for more times to call the prng (18~25% more). Because prng's can be much slower than the Xorshift used in the benchmarks, this is not a promising route to take...
It also took me a lot of time to understand your method to calculate the zone. So I tried to come up with something myself, and it is exactly the same. I guess that proves there are no more off-by-one errors :-).
What I could do was add some extra comments to explain what is happening here. Also the modulus could be optimised a very little bit thanks to this trick: skipping it for the small chance
v
falls in the target range, making it a few percent faster.What I would like to know your opinion about: is it useful to expose an inclusive range? As per this comment. It fit a little better with my mental model when writing the code.