dhardy / rand

http://doc.rust-lang.org/rand
Other
2 stars 2 forks source link

Distributions: Uniform, Uniform01, Open01, etc. #81

Closed dhardy closed 6 years ago

dhardy commented 6 years ago

Update: we will likely use these distributions:

Original post follows.


I'm looking at bringing over some of the changes to distributions code to upstream, in particular replacing Sample and IndependentSample with Distribution, but the change-set touches on a few other things (adjustments to things like Exp1 are needed to support dynamic dispatch (Rng+?Sized) in Distribution::sample, since Rand doesn't support this and changing that would break all Rand implementations everywhere).

So, question: is everyone happy with the distribution names we converged on in this branch, that is

Wacky ideas:

Of course we don't need any changes. Just asking since it didn't always appear people were happy with these names.

pitdicker commented 6 years ago

I'm looking at bringing over some of the changes to distributions code to upstream

:tada:. Also, thanks for opening this issue for a fresh look and discussion. Anyone we should cc?

In https://github.com/rust-lang/rfcs/pull/2106#discussion_r136091362 I suggested that Basic may be a good alternative to Default, so it does not clash with std::default::Default.

Dropping Uniform in favor of Default (or Basic) seems like a nice simplification for integers. FullRange is a nice find as alternative, it fits with the Range distribution. But I don't think code using this is thinking in terms of ranges, and it may feel less natural there.

For the uniform distributions of floats we now have:

I don't think the names are perfect. Uniform01 and the alternative HalfOpen01 both have disadvantages. How do you know which side is open? [0, 1) vs (0, 1].

For the Open01 distribution I can see real uses. In some situations you never want to generate 0.0, for example when the value is to be used for division or log. It uses simple rejection sampling on top of Uniform01.

Personally I am not sure if we need both the Uniform01 and Closed01 distribution. We can guarantee that every value has an uniform chance of occurring. But the chance of generating 1.0 is one in 2^53 for f64 (and 1 in 2^22 for f32). Does that really merit two different distributions?

There is an other not-all-that-meaningful difference between Uniform01 and Closed01: all numbers that straddle an exponent boundary have a tiny chance of occurring to often. 0.5 will occur every 1 in 2^23 results too often, 0.25 every 1 in 2^24, 0.125 every 2^25 (for f32, and I wrote this quickly, exponents could be off by 1). The problem is that there is just no better distribution possible for a closed-open range. Closed01 has a distribution as perfect as technically possible, but it also is slightly slower.

I think renaming Uniform01 to Default (or Basic) is a good solution, leaving only Open01 and Closed01 as slower but sometimes useful alternatives for floats.

pitdicker commented 6 years ago

(adjustments to things like Exp1 are needed to support dynamic dispatch (Rng+?Sized)

Removing the Sized constraint is a desired change, but you probably want to wait with it so all the changes that effect RNG's are bundled in one release?

I don't really understand the consequences around dynamic dispatch. I should just try it and see what brakes, but would you be willing to explain it?

pitdicker commented 6 years ago

In https://github.com/dhardy/rand/issues/72#issuecomment-353937015 I made a list of changes / extra tests we may want to add to the distributions, besides something like an external test suite. There is not really anything in there that should happen before the merge into upstream though.

Something I would really like to see is to have the floating point range implementation be exact, and not sometimes crossing the boundaries. Now it is a bit strange that we can say: the precision of the results of Range has improved much, but it is still imprecise around the boundaries. I still believe it must be simple, but haven't found the trick yet.

dhardy commented 6 years ago

Dynamic dispatch: lots of OO languages use a vtable at the beginning of an object, for example an Java object of theHashMap class is in fact a fat object with a pointer to the HashMap virtual table as well as the internal map fields. That way if you cast the object to an AbstractMap then call size(), the run-time finds HashMap::size. Rust works a bit differently. There isn't a Map trait because it requires Higher Kinded Types, but for example String implements Write. This String object does not have a vtable, nor does a &mut String reference to it, but if you want to cast that to &mut Write, the CPU needs to know how to access the write_str function implementation for String. The way that works is that a &mut String is a fat pointer, containing one pointer to the vtable and another to the String object. So what's ?Sized got to do with it? &mut Write is by default Sized, implying the compiler knows the size of the pointed object, which is only possible if the compiler knows (at compile time) the type of the underlying object. If it doesn't, you need &mut Write + ?Sized.

dhardy commented 6 years ago

Removing the Sized constraint is a desired change, but you probably want to wait with it so all the changes that effect RNG's are bundled in one release?

The last few commits re-implement some types as distributions to let me switch them. This still needs Uniform01 or equivalent however.

Why do it now? Changing Distribution::sample from sample<R: Rng> to sample<R: Rng + ?Sized> is a breaking change: all implementations must be updated. So I'd rather get it right first time.

dhardy commented 6 years ago

In #72 (comment) I made a list of changes ...

I'm not looking at changing the algorithms now, just the infrastructure. I might let you make the algorithms PRs ;-)

Hmm, rand::distributions::Default or rand::distributions::Basic... I still prefer the former, but not certain. Another option might be Standard, but that name is probably even more confusing than Default; obviously Normal is inappropriate (thanks Gauss).

Removing Uniform and Uniform01 may make sense, though they don't cost us much (aside from confusing users).

dhardy commented 6 years ago

@pitdicker maybe I can annoy you even more by putting implementations in src/distributions/float.rs and src/distributions/integer.rs. Then I think we also do not need src/utils.rs. Okay?

At least, the compiler doesn't complain, though my text editor highlights both Default and float incorrectly. Not important; the module names will probably never be exposed to users anyway.

pitdicker commented 6 years ago

Thanks for the explanation around ?Sized. I mostly get the idea behind how it works, but have some trouble yet so see all the consequences for the API... Give me some time and otherwise I will trust you on it.

I might let you make the algorithms PRs ;-)

Ok, it is high time I start working on upstream šŸ˜„.

maybe I can annoy you even more by putting implementations in src/distributions/float.rs and src/distributions/integer.rs. Then I think we also do not need src/utils.rs. Okay?

Just two weeks ago I was thinking something like that would be nice, but more with the range code in mind. Could the RangeImp parts also be part of the new float.rs and integer.rs files? And what is left be moved to src/distributions/mod.rs?

dhardy commented 6 years ago

I hadn't thought about moving the range code, but I guess that could work, if there's not too much to it.

The distributions top-level module currently has Distribution, some Rand bits, some "weighted" distribution code I don't know what to do with, and the ziggurat function. Maybe we should try moving the last two things out if we're adding Default/Basic and Range.

dhardy commented 6 years ago

I believe we settled this, though we now use Standard as the "default" distribution.