Open tspiteri opened 6 years ago
I don't think number 2 captures how the functions in FloatConversions
work. (without all the details) The functions with higher precision take the normal amount of bits for the mantissa, and the remaining bits are used to fill the exponent by taking something similar to the logarithm of them. For the lowest exponent there is a small correction because we can't generate all possible exponent's a float supports.
So there is no rounding, no infinitely long stream and only one number from the RNG is used.
The smallest delta between values is 1/2^32 (for f32
), and the largest delta 1/2^24 (ɛ/2).
Note that we decided against keeping next_f32
and next_f64
in the Rng
trait, the distributions will be the only interface to get random floating point numbers.
I forgot to say: thank you for studying this :smile:.
There was already an issue open how and what to merge back into rand
upstream, https://github.com/dhardy/rand/issues/81.
Back when I added https://github.com/dhardy/rand/pull/1 there surprisingly seemed to be no performance difference between the normal methods and the ones with extra precision. But there was a problem with the benchmarks, and I had a lot to learn (better now, but still true :smile:). Turns out they are slower, as expected.
Now I think the fixed precision variants are a better default. I agree Closed01
is not really worth keeping then. It doesn't do anything special, it could just as well be replaced with a range. The Open01
distribution is useful for all algorithm's that don't work with 0.0
, we should keep that.
For the higher precision variants it is another matter, because it needs different strategies for Open01
and Closed01
. Something to think about some more. I don't like the number of different functions we are getting this way...
(Assuming f32
). FloatConversions::closed_open01
generates 23 bits for the fraction, so ignoring the exponent the random number is any of the 223 possibilities from 1.0 to 2.0−2−23. It also does some computation for the exponent. Basically it does this:
Method 2 is more general and ignoring rounding, it is functionally equivalent except for the last case, for which it is a generalization. Note that the results would be different because of the implementation, but the outcomes would be equally probable except for the last case. Basically, for method 2 you get this ignoring rounding:
u32
), exp = −1 (equivalent to first case above).u32
), exp = −2 (equivalent to second case above).u32
), exp = −9 (equivalent to penultimate case above).u32
), exp = −10 (no equivalent above).u32
values), exp = −126. This is still normal.For what it's worth, I have some code doing this, and I just cleaned it up a bit. It looks like this:
struct BitBuffer {
bits: u32,
len: u32,
}
impl BitBuffer {
fn new() -> BitBuffer {
BitBuffer { bits: 0, len: 0 }
}
fn fill<R: Rng + ?Sized>(&mut self, rng: &mut R) {
self.bits = Rng::next_u32(rng);
self.len = 32;
}
// gets n random bits
fn get_bits<R: Rng + ?Sized>(&mut self, mut n: u32, rng: &mut R) -> u32 {
assert!(n <= 32);
if n == 0 {
return 0;
}
if self.len == 0 {
self.fill(rng);
}
let mut ans = 0;
if n > self.len {
n -= self.len;
ans |= self.bits << n;
self.fill(rng);
};
self.len -= n;
ans |= self.bits >> self.len;
self.bits &= !(!0 << self.len);
ans
}
// finds the first significant bit, if available in first max_bits bits
fn first_one<R: Rng + ?Sized>(&mut self, max_bits: u32, rng: &mut R) -> Option<u32> {
let mut prev_leading = 0;
loop {
// count only leading zeros in available bits
let add_to_leading = self.bits.leading_zeros() + self.len - 32;
let leading = prev_leading + add_to_leading;
// check if we have no ones in initial max_bits
if leading >= max_bits {
// skip consumed zeros
self.len -= max_bits - prev_leading;
return None;
}
// check if we have some significant bit
if add_to_leading < self.len {
// skip not just zeros, but also first significant bit
self.len -= add_to_leading + 1;
self.bits &= !(!0 << self.len);
return Some(leading);
}
// we do not have enough bits yet
prev_leading = leading;
self.fill(rng);
}
}
}
// The random number generated is 0.*, where * is an infinitely long
// stream of random bits.
//
// If * starts with (BIAS - 1) zeros, the number is subnormal. For a
// subnormal number, we set the biased exponent to zero and the
// mantissa bits to the random bits after the leading (BIAS - 1)
// zeros.
//
// For a normal number, we set the biased exponent to (BIAS - 1 -
// leading zeros), then we skip the one bit that stopped the stream of
// leading zeros which will be encoded implicitly. Then we set the
// mantissa bits to the random bits after that first significant bit.
//
// For both subnormal and normal numbers, we have to check rounding.
// We are rounding to the nearest. A tie is not possible since * is an
// infinitely long stream of random bits; so in this case we get a
// random bit to decide whether to round down or up.
pub fn uniform01_f32<R: Rng + ?Sized>(rng: &mut R) -> f32 {
const BIAS: u32 = 127;
const MANT_BITS: u32 = 23;
let mut bits = BitBuffer::new();
let mut u = match bits.first_one(BIAS - 1, rng) {
None => 0,
Some(first) => (BIAS - 1 - first) << MANT_BITS,
};
u |= bits.get_bits(MANT_BITS, rng);
// Handle rounding. For rouding up, just add 1 to u, which always
// moves to the next floating-point number above.
if bits.get_bits(1, rng) == 1 {
u += 1;
}
unsafe { mem::transmute(u) }
}
Good explanation. That is mostly how the source of the idea was described in https://readings.owlfolio.org/2007/generating-pseudorandom-floating-point-values/ (without a practical way to implement it though).
If we are only thinking in terms of Distributions
I think your method 2 can work, requesting more random numbers when necessary. It should not even be much or any slower. Yet I think a conversion function that just uses all bits of one round of the RNG is nice, a function that can generate every possible floating point value between [0, 1) seems overkill. One reason I did not go that route initially is that for f64
it could require up to 16 u64
's to pick the exponent, although that is extremely unlikely.
Requiring 16 u64
values is so unlikely the universe could well reach its heat death before that happens 😄. Unless the random number generator is just producing zeros, but I think that would only be useful in testing.
Found something interesting: suppose you want to generate a random f64
, that can be any possible value in the range [0, 1). The exponent can have a negative up to -1024. The mantissa is 52 bits. In total 1076 bits are needed from the RNG.
The chance to get a certain value depends on how close it is to zero, with the smallest values having a chance of 1 in 1076. To be able to generate such a number the RNG must have a period of at least 2^1076.
Didn't see your post before commenting.
For myself I work with the number that today's hardware on a single processor can generate about 2^30 values per second. That means ~2^55 per year, and 2^64 takes about 500 CPU-years. Any conversion method that just takes a single u64
seems good enough to me.
Yeah, it is pretty unlikely to require two u64
values. The method I posted above only becomes really significant for low precisions, e.g. with half::f16
.
So, to summarise, we have:
u32
/ u64
Separate to that, we have five questions:
Are these details likely to matter?
fn bernoulli(p: f64) -> bool { rng.gen() < p }
— precision is going to be limited to roughly ε anyway; but bernoulli(0)
should be impossible and bernoulli(1)
guaranteed (may make sense to use <=
instead); also some people might expect bernoulli(p)
to work well for small p
(say, less than 2^-20)let index = (rng.gen() * len as f64) as usize
— may round to len
anyway; people should definitely use rng.gen_range(0, len)
instead!fn exp() -> f64 { -rng.gen().ln() }
— may generate +inf; in theory this is okay; for most uses precision of very large results won't be very important (e.g. this distribution is used to sample a decay time given half-life)For many cases, the exact details are not important. In some ways, allowing sampling of 1 but never 0 makes more sense as a default than the status quo, but I'm uncomfortable both making this breaking change and bucking the norm.
Do we need [0, 1)
, (0, 1]
, [0, 1]
and (0, 1)
? No. Arguably (0, 1]
alone should satisfy most use-cases. But I'm not sure which is the best sub-set.
Not really sure at this point, but benchmarks of various optimised implementations would be nice.
Edit: added questions 4 and 5 and a few details.
Of the variant "Generate values uniformly distributed between 0 and 1-ε with step ε (optionally offset)", the ranges [0, 1)
, (0, 1)
and (0, 1]
can all be generated with exactly the same performance.
Step 1 is always to make a float in the range [1, 2)
. Then subtract 1.0
, 1.0-ε/2
or 1.0-ε
respectively.
Which implies it would be nice to move the next_f32
impl out to a distribution/Rand
implementation where all variants can be implemented uniformly, and drop Closed01
.
Note: I had a glance at dSFMT and it looks like it's not a "native FP generator" anyway, but works similar to some of the methods mentioned here, except that it only generates 52 random bits.
Which implies it would be nice to move the
next_f32
impl out to a distribution/Rand
implementation where all variants can be implemented uniformly, and dropClosed01
.
👍 from me.
A benchmark (warning: tempered with 😄; I changed the names):
test gen_u32_xorshift ... bench: 1,371 ns/iter (+/- 4) = 2917 MB/s
test fixed_01_f32 ... bench: 1,599 ns/iter (+/- 2) = 2501 MB/s
test open_closed_01_f32 ... bench: 2,629 ns/iter (+/- 8) = 1521 MB/s
test open_open_01_f32 ... bench: 2,761 ns/iter (+/- 15) = 1448 MB/s
test closed_open_01_f32 ... bench: 2,521 ns/iter (+/- 14) = 1586 MB/s
Xorshift is included for reference, the time the conversion takes is the difference with Xorshift. fixed
is for the three ranges that generate a float with ε precision. The other three are my higher precision attempts.
The three higher-precision variants are ±3× slower than the fixed precision variants. Generating f32
s with a generator like Xorshift is 15% vs. 50% slower than generating integers. Using an Xorshift variant that works on u64
s to generate f64
s should have similar results.
The large difference in performance is (was) a bit surprising to me. Besides mov
s etc. fixed
takes just 3 operations. closed_open
takes 7 operations (some of which can be done in parallel) and 1 branch. But if I look at the assembly of closed_open01
it contains two more branches than expected. Not sure what is going on yet, looks like trailing_zeros
does not work optimally...
Hmm. Turns out the extra branch and jump are inserted because counting the number of trailing zero's of 0
is undefined.
Manually setting a bit makes LLVM optimize out the branch (2,505 ns/iter (+/- 13) = 1596 MB/s
):
#[inline(always)]
fn closed_open01(self) -> $ty {
const MIN_EXPONENT: i32 = $FRACTION_BITS - $FLOAT_SIZE;
#[allow(non_snake_case)]
let ADJUST: $ty = (0 as $uty).binor_exp(MIN_EXPONENT);
// 2^MIN_EXPONENT
// Use the last 23 bits to fill the fraction
let fraction = self & $FRACTION_MASK;
let remaining = self >> $FRACTION_BITS;
// Use the remaing (first) 9 bits for the exponent
let exp = - ((remaining | (1 << $FRACTION_BITS))
.trailing_zeros() as i32);
if exp >= MIN_EXPONENT {
fraction.binor_exp(exp)
} else {
fraction.binor_exp(MIN_EXPONENT) - ADJUST
}
}
Checking if the value is not zero is faster, but LLVM still inserts an unnecessary extra conditional jump (2,455 ns/iter (+/- 11) = 1629 MB/s
):
#[inline(always)]
fn closed_open01(self) -> $ty {
let fraction = self & $FRACTION_MASK;
let remaining: $uty = self >> $FRACTION_BITS;
if remaining != 0 {
let exp = remaining.trailing_zeros() as i32;
fraction.binor_exp(-exp)
} else {
const MIN_EXPONENT: i32 = $FRACTION_BITS - $FLOAT_SIZE;
let adjust: $ty = (0 as $uty).binor_exp(MIN_EXPONENT);
// 2^MIN_EXPONENT
fraction.binor_exp(MIN_EXPONENT) - adjust
}
}
Anyway, a little optimization is still possible, the limit is probably something like 2,400 ns/iter = 1650 MB/s
(to compare with the above benchmark). Just somehow have to convince LLVM...
So would a good strategy be to generate fixed-precision [0, 1)
by default, with some variants available as options?
[1, 2)
fixed precision(0, 1)
fixed precision[0, 1)
32 / 64 bit precision (@pitdicker's code above)[0, 1)
— @pitdicker's approach with extended precision?[0, 1]
maximum precision (infinite bit-sequence rounded to representable values)?If you're both happy with the rough plan, I guess we start by merging https://github.com/rust-lang-nursery/rand/pull/237 (or my reshuffle of it), then open PRs for each other change. @tspiteri @pitdicker?
Initially I was arguing that the random in [0,1)
should have a fixed step (basically ɛ) all through the range; probably subconsciously influenced by the issue mentioned in for example https://docs.oracle.com/javase/8/docs/api/java/util/Random.html#nextFloat-- where the division could cause nonuniformity (BTW, the documented correct Java nextFloat
has steps of ɛ/2). But looking at @pitdicker's approach that uses the extra bits, it is not using something equivalent to the division mentioned in the Java doc which rounds to nearest, his algorithm is equivalent to division with truncation, where the issue disappears. So now I do not have reservations about using the extra bits any more, after all they are free in the sense that the random generator has already generated them.
I like the [1,2)
range; it is a nice way to provide fixed precision with fixed steps for the cases where those properties are required, and it is provided in a way that does not need to be over-documented as there is only one obvious way to generate random numbers in [1,2)
, that is steps of ɛ. And including this is good for completeness; for example it would be incorrect to just use [0,1)
range with extra precision and then add one and subtract one again to try to force equal steps, as that would cause the rounding nonuniformity mentioned in the linked Java doc.
And I also like that the [0,1]
interval with maximum precision is included, as I see this as generating a random number with a continuous uniform distribution with correct rounding.
In short, I like @dhardy's proposal.
So would a good strategy be to generate fixed-precision
[0, 1)
by default, with some variants available as options?
[1, 2)
fixed precision
I am against including this range, but do see some interesting uses. Two reasons:
Range
distribution.To be honest I don't see the use. Well, I can see some use, but think there are more 'strange' ranges that could be interesting. And where do you stop?
It is not the only 'special' or 'obvious' range
As an example, the fastest way to generate a number in the range [-1, 1)
is to make a float in the range [2, 4)
and subtract 2.0
.
It is not a useful range for users, it only reveals an implementation detail
The [1, 2)
is nice conceptually, but I can't imagine many user wanting to use it directly. You just about always will have to convert it to the range you need. But is [1, 2)
somehow a more useful range for performance?
We can do something similar (and better) in the Range
distribution
With fixed precision there is one subtraction involved to get the range down to (0, 1)
. To convert it to the right range requires at most one addition and one multiply. Is LLVM able to combine the two addtions? That would mean we have to do the addition first in the range code. Or could it even combine them and do a fused multiply–add on ARM? Then we would have to do the multiply first. Probably all not without some fast-math
flag.
If we write a custom conversion in the Range
code that first makes a float in the range [1, 2)
, then multiplies it, and does a single addition, I think our range code is as optimal as it gets (for a generic function).
- (0, 1) fixed precision
I would pick this one, (0, 1)
as a default. We figured out elsewhere that the guarantee to not return some value is more interesting than the guarantee to generate it. The change to generate 0.0
with [0, 1)
is just 1 in 2^53 (for f64
). So for many uses it will never appear, and the guarantee is not very interesting. While the guarantee to not generate 0.0
is necessary for the correctness of some algorithms.
[0, 1)
32 / 64 bit precision (@pitdicker's code above)
I think the benefit is too small to make it the default, considering the performance is less than fixed precision.
[0, 1)
— @pitdicker's approach with extended precision?
I don't think any more precision buys us much.
Compared to the fixed precion method using 32/64 bits of precision will produce a number with more precision in about 50% of the values. The next step, almost infinite precision, will on average only once every 512 values produce something with 1 bit extra of precision (and only once in 4096 for f64
).
The cost is that the number of rounds used from the RNG becomes variable, instead of this being a simple conversion function. But I don't think it will be much slower.
[0, 1]
maximum precision (infinite bit-sequence rounded to representable values)?
Aren't these two basically the same, as far as the end user is concerned? I am curious what the performance from @tspiteri's code from https://github.com/dhardy/rand/issues/85#issuecomment-358974314 is, but don't imagine it to be great to be honest...
So in the end I would only pick two ranges:
(0, 1)
with fixed precision as default[0, 1)
with 32/64 bits precisionOf all the possible other ranges I do think the [-1, 1)
range is interesting (possibly with different closed/open bounds) because I encountered it in quite a few algorithms. Especially with 32/64 bit precision it becomes useful, because the numbers with higher precision are not concentrated near the start of the range, but near the middle. But probably best to keep that out of the discussion for now.
Just to keep this somewhere... I tried to find an alternative to counting zero's. A nice trick is to convert the bits that are not used for the fraction (remaining
) from an integer to a float. This will give a float with the logarithm of the number encoded as exponent. The exponent only needs to be adjusted to be negative to get the right value. It is not necessary to mask out the fraction of this number, just XOR-ing it with the random fraction bits is enough (instead of OR).
fn closed_open01(self) -> $ty {
let fraction = self & $FRACTION_MASK;
let remaining = self >> $FRACTION_BITS;
if remaining != 0 {
unsafe {
let mut exp: $uty = mem::transmute(remaining as $ty);
exp -= ($FLOAT_SIZE - $FRACTION_BITS) << $FRACTION_BITS;
mem::transmute(fraction ^ exp)
}
} else {
const MIN_EXPONENT: i32 = $FRACTION_BITS - $FLOAT_SIZE;
let adjust: $ty = (0 as $uty).binor_exp(MIN_EXPONENT);
// 2^MIN_EXPONENT
fraction.binor_exp(MIN_EXPONENT) - adjust
}
}
But in the end the performance is slightly less than the zero counting method (2,553 ns/iter (+/- 35) = 1566 MB/s
).
@pitdicker if we can get fast conversion to specified ranges via Range
that would be awesome. It does sound a little complicated though — there is open vs closed (unless we just do open everywhere) as well as multiple possible precisions.
(0, 1)
as a default sounds fine. It's a change from the current [0, 1)
but not one that most users would notice.
0-1 with 32/64 bits of precision may be a useful compromise sometimes. Not sure.
Having a maximal precision 0-1 range would be useful sometimes I think — though since it is not quite trivial to convert to a high-precision [-1, 1 ]
range (or other range spanning zero), support via Range
would be nice.
Maybe in the end we should have two or three range types:
FixedPrecisionRange
— if this makes sense for arbitrary rangesWordPrecisionRange
— i.e. roughly what your current code in my branch does, IIRCMaximalPrecisionRange
I had a glance at dSFMT and it looks like it's not a "native FP generator" anyway, but works similar to some of the methods mentioned here, except that it only generates 52 random bits.
IIRC, dSFMT provides some fn fill(&mut [f64])
function that is a lot faster than generating floats from 64 bit integers one-by-one. I'm not sure how strongly that is coupled to the implementation of the RNG.
It does sound a little complicated though — there is open vs closed (unless we just do open everywhere) as well as multiple possible precisions.
I am more and more leaning towards not caring much for closed/open ranges with floating point ranges, because after many hours of trying I still can't get all the rounding errors out of the floating point range code (first try back in September? last try in January).
For some ranges it is even impossible to represent the size of the range, while the start and end can be exactly represented. In those cases there is no way to get the bounds 'right'. So for primitives like the (0, 1)
range worrying about the exact bounds makes sense, but for arbitrary ranges I think we are already doing well if the result stays within the bounds :cry:.
For integer ranges I once had the thought that both the current range code and a less uniform one make sense. Now we lose some performance to checking the zone
to sample from. While most ranges in an u32
are so small that not checking the zone will have negligible bias. We could add a RangeFast
distribution or something.
Than for both integer ranges and floating point ranges we would have a choice between high accuracy and best performance.
I am getting a little more time again, and see you distribution PR has landed :tada:. Shall I try to work on this?
Did some tests, both 32/64 bits precision and maximum precision have the same performance. Let's just go with maximum then. Definitely better than having three choices.
Yes, I decided to merge that. I also completed a merge back into my master branch — phew, that was hard! So now is probably a good time to bring over (or redo) FP sampling / range code if you're interested.
I think perhaps we should keep the [0 ,1)
(i.e. half-open) range available, not necessarily as the default, but because it is widely used and quite useful if someone wishes to try reproducing results from other languages in Rust, and also because of patterns like rng.gen() < p
and let index = (rng.gen() * (len as f64)) as usize
. We may recommend users don't use those patterns, but they may still be used, e.g. in direct code translations.
HalfOpen01
distribution (I think this alone should be enough for most uses where Uniform
/ Range
are not sufficient)What is left to do here? I don't think we need HalfOpen01
, because it is equivalent to Range::new(0., 1.)
, at least it should be after adding some #[inline]
attributes.
Yes, probably nothing. But lets leave this open at least until we have some high-precision range integrated.
I've had a look at
FloatConversions
andUniform01
,Closed01
andOpen01
. I'm not sure all the features should be there. Personally I would only provide the following two features, one from a generator perspective, and the other more from a distribution perspective:A way to generate a random number similar to the
next_f32
ornext_f64
in rand 0.4, and which is now inFloatConversions::open_closed01_fixed
. I would not provide any of the other five functions inFloatConversions
and think that if this is not sufficient, probably what is required is something from the uniform distribution. I prefer the situation with rand 0.4 here, wherenext_f32
andnext_f64
are part of the generator.For the uniform distribution inside the
distributions
module, I would provideUniform01
only and removeClosed01
andOpen01
, and for the implementation I would specify that it is equivalent to generating a floating-point number with arbitrarily large precision with a continuous uniform distribution in the range [0,1], and rounding.Number 1 is I think the most common way to simply generate a random float, and it always produces m bits of randomness where m is the number of mantissa bits. The output is a multiple of ɛ that is < 1, and every possible outcome has a probability of occurrence = ɛ. This is what I would expect from a generator. It is very similar to generating integers, except that the randomness is in the mantissa only.
Number 2 is from the distribution point of view. With rounding to the nearest, both 0.0 and 1.0 are possible outcomes. This is functionally equivalent to producing 0. where is an infinitely long random bit stream, and then rounding. If is an infinite stream of zeros, then the number is exactly zero, and if is an infinite stream of ones, then the number is exactly one.
For this to be done correctly, more than one
u32
may need to be generated for a randomf32
if initially we get lots of zeros. But this is improbable so it should not be considered a performance issue. For example forf32
, if the first generatedu32
does not start with eight zeros, it is sufficient: let's say the generated number is seven zeros followed by 25 ones. Then the output should be0.0000_000
followed by 25 ones. This would produce anf32
equal to (2−ɛ)×2−8, but 2−ɛ only requires 24 of our ones. The 25th significant bit is used to see if we should round down or up, so if that bit is zero we round down returning (2−ɛ)×2−8, otherwise we round up returning 2−7. Note that we do not need to care about ties-to-even here, since * is infinitely long a tie has zero probability.For three or more
u32
numbers to be required, the firstu32
has to be exactly zero, and the secondu32
must start with seven zeros. So the probability that oneu32
is required is 1−2−8, the probability that twou32
values are required is 2−8−2−40, and the probability that three or more are required is 2−40.