dhardy / rand

http://doc.rust-lang.org/rand
Other
2 stars 2 forks source link

Allow sampling from a closed integer range #2

Closed pitdicker closed 6 years ago

pitdicker commented 7 years ago

This does not compile, but I wanted to get your opinion.

The last few days I tried just about every trick to make sampling from a range of integers faster. Like making the zone a power of 2 (so mod becomes &), and multiplication with mersenne numbers. In the end, you are just trading the modulus for more times to call the prng (18~25% more). Because prng's can be much slower than the Xorshift used in the benchmarks, this is not a promising route to take...

It also took me a lot of time to understand your method to calculate the zone. So I tried to come up with something myself, and it is exactly the same. I guess that proves there are no more off-by-one errors :-).

What I could do was add some extra comments to explain what is happening here. Also the modulus could be optimised a very little bit thanks to this trick: skipping it for the small chance v falls in the target range, making it a few percent faster.

What I would like to know your opinion about: is it useful to expose an inclusive range? As per this comment. It fit a little better with my mental model when writing the code.

pitdicker commented 7 years ago

I fixed the spelling in the comments, thank you.

About how or where to add a constructor: there are so many designs and questions in the RFC thread, I don't know where to start. And I now almost nothing about API design... Probably best to leave some comments on the RFC tread first

dhardy commented 7 years ago

I'd rather address other stuff (the traits and crate split) on the RFC thread first; there's too much going on there at the moment.

Using inclusive ranges seems sensible to me but would be a breaking change, so maybe a second constructor/function.

pitdicker commented 7 years ago

Sorry for derailing the discussion on the RFC tread, I saw your comment to late.

I sure don't think changing the default from open to closed ranges is a good idea! It would be very easy to silently break the code of others. Even with good documentation.

pitdicker commented 7 years ago

I had two idea's that should help make the range code a bit faster in some cases:

I'll report back when I have something.

pitdicker commented 7 years ago

I spend a few more hours trying to figure out why to code runs slower than it should. Without much success...

It turns out the modulus operation is slow, but not the real problem. Apparently rust adds a check before dividing by 0, and possibly panics. This slowed down the function by about 5%. If I leave out the loop from sample, it is 40% faster. And that while the loop almost always (99+%) runs just once.

The idea to pick a larger zone for u8 and u16 worked out nicely. But it got a little complicated with the rules for casting. The idea of providing two different techniques produced some nice and complicated code, but was not any faster. Benchmarks: Before:

test distr_range2_i16        ... bench:       2,513 ns/iter (+/- 51) = 397 MB/s
test distr_range2_i8         ... bench:       2,941 ns/iter (+/- 30) = 340 MB/s
test distr_range2_int        ... bench:       3,104 ns/iter (+/- 37) = 2577 MB/s
test distr_range_int         ... bench:       3,102 ns/iter (+/- 29) = 2578 MB/s

After:

test distr_range2_i16        ... bench:       1,238 ns/iter (+/- 24) = 807 MB/s
test distr_range2_i8         ... bench:       1,231 ns/iter (+/- 34) = 812 MB/s
test distr_range2_int        ... bench:       2,961 ns/iter (+/- 55) = 2701 MB/s
test distr_range_int         ... bench:       3,120 ns/iter (+/- 43) = 2564 MB/s

I have added a function new_inclusive to RangeImpl, but not exposed the closed range methods further.

pitdicker commented 7 years ago

Rebased. You found the strange performance difference I have been searching for for hours: the compiler became to smart with the benchmarks :-). Now it all looks a lot less nice...

Before:

test distr_range2_i8         ... bench:       5,706 ns/iter (+/- 52) = 175 MB/s
test distr_range2_i16        ... bench:       4,699 ns/iter (+/- 33) = 425 MB/s
test distr_range2_i32        ... bench:       5,325 ns/iter (+/- 46) = 751 MB/s
test distr_range2_i64        ... bench:      11,080 ns/iter (+/- 26) = 722 MB/s

After:

test distr_range2_i8         ... bench:       4,863 ns/iter (+/- 38) = 205 MB/s
test distr_range2_i16        ... bench:       4,861 ns/iter (+/- 38) = 411 MB/s
test distr_range2_i32        ... bench:       5,426 ns/iter (+/- 50) = 737 MB/s
test distr_range2_i64        ... bench:      10,925 ns/iter (+/- 24) = 732 MB/s

Some improvements, some losses. But it all depends very much on the size of the range, at least for i32 and i65.

It still think this pr is useful, as it adds support for closed ranges (e.g. handling ranges that can cover the entire range of the type). And the optimisation of small integers and extra comments.

dhardy commented 7 years ago

Yes, I had my head scratching why moving benchmarks from one module to another made a big difference, until I realised one used blackbox. Micro-benchmarks are tricky.

Can you add a Range::new_inclusive constructor?

pitdicker commented 6 years ago

Finally finished this. Sorry for taking so long.

Would it be okay if I make a PR that removes Range and replaces it with Range2?

dhardy commented 6 years ago

No problem.

Yes, I was planning on removing the original range; actually this PR is the reason I didn't yet. Go ahead.

pitdicker commented 6 years ago

Thank you. Removed the original range.

pitdicker commented 6 years ago

Hi @dhardy. Is there a chance that you could merge this and the other PRs, or that we can move them along?

dhardy commented 6 years ago

Yeah, I guess. Sorry, I've been busy with some other work and travel the last couple of weeks, should have more time now.

pitdicker commented 6 years ago

Thank you!