dhardy / rand

http://doc.rust-lang.org/rand
Other
2 stars 2 forks source link

discussion on uniform float distributions #85

Open tspiteri opened 6 years ago

tspiteri commented 6 years ago

I've had a look at FloatConversions and Uniform01, Closed01 and Open01. I'm not sure all the features should be there. Personally I would only provide the following two features, one from a generator perspective, and the other more from a distribution perspective:

  1. A way to generate a random number similar to the next_f32 or next_f64 in rand 0.4, and which is now in FloatConversions::open_closed01_fixed. I would not provide any of the other five functions in FloatConversions and think that if this is not sufficient, probably what is required is something from the uniform distribution. I prefer the situation with rand 0.4 here, where next_f32 and next_f64 are part of the generator.

  2. For the uniform distribution inside the distributions module, I would provide Uniform01 only and remove Closed01 and Open01, and for the implementation I would specify that it is equivalent to generating a floating-point number with arbitrarily large precision with a continuous uniform distribution in the range [0,1], and rounding.

Number 1 is I think the most common way to simply generate a random float, and it always produces m bits of randomness where m is the number of mantissa bits. The output is a multiple of ɛ that is < 1, and every possible outcome has a probability of occurrence = ɛ. This is what I would expect from a generator. It is very similar to generating integers, except that the randomness is in the mantissa only.

Number 2 is from the distribution point of view. With rounding to the nearest, both 0.0 and 1.0 are possible outcomes. This is functionally equivalent to producing 0. where is an infinitely long random bit stream, and then rounding. If is an infinite stream of zeros, then the number is exactly zero, and if is an infinite stream of ones, then the number is exactly one.

For this to be done correctly, more than one u32 may need to be generated for a random f32 if initially we get lots of zeros. But this is improbable so it should not be considered a performance issue. For example for f32, if the first generated u32 does not start with eight zeros, it is sufficient: let's say the generated number is seven zeros followed by 25 ones. Then the output should be 0.0000_000 followed by 25 ones. This would produce an f32 equal to (2−ɛ)×2−8, but 2−ɛ only requires 24 of our ones. The 25th significant bit is used to see if we should round down or up, so if that bit is zero we round down returning (2−ɛ)×2−8, otherwise we round up returning 2−7. Note that we do not need to care about ties-to-even here, since * is infinitely long a tie has zero probability.

For three or more u32 numbers to be required, the first u32 has to be exactly zero, and the second u32 must start with seven zeros. So the probability that one u32 is required is 1−2−8, the probability that two u32 values are required is 2−8−2−40, and the probability that three or more are required is 2−40.

pitdicker commented 6 years ago

I don't think number 2 captures how the functions in FloatConversions work. (without all the details) The functions with higher precision take the normal amount of bits for the mantissa, and the remaining bits are used to fill the exponent by taking something similar to the logarithm of them. For the lowest exponent there is a small correction because we can't generate all possible exponent's a float supports.

So there is no rounding, no infinitely long stream and only one number from the RNG is used. The smallest delta between values is 1/2^32 (for f32), and the largest delta 1/2^24 (ɛ/2).

pitdicker commented 6 years ago

Note that we decided against keeping next_f32 and next_f64 in the Rng trait, the distributions will be the only interface to get random floating point numbers.

pitdicker commented 6 years ago

I forgot to say: thank you for studying this :smile:.

There was already an issue open how and what to merge back into rand upstream, https://github.com/dhardy/rand/issues/81.

Back when I added https://github.com/dhardy/rand/pull/1 there surprisingly seemed to be no performance difference between the normal methods and the ones with extra precision. But there was a problem with the benchmarks, and I had a lot to learn (better now, but still true :smile:). Turns out they are slower, as expected.

Now I think the fixed precision variants are a better default. I agree Closed01 is not really worth keeping then. It doesn't do anything special, it could just as well be replaced with a range. The Open01 distribution is useful for all algorithm's that don't work with 0.0, we should keep that.

For the higher precision variants it is another matter, because it needs different strategies for Open01 and Closed01. Something to think about some more. I don't like the number of different functions we are getting this way...

tspiteri commented 6 years ago

(Assuming f32). FloatConversions::closed_open01 generates 23 bits for the fraction, so ignoring the exponent the random number is any of the 223 possibilities from 1.0 to 2.0−2−23. It also does some computation for the exponent. Basically it does this:

Method 2 is more general and ignoring rounding, it is functionally equivalent except for the last case, for which it is a generalization. Note that the results would be different because of the implementation, but the outcomes would be equally probable except for the last case. Basically, for method 2 you get this ignoring rounding:

tspiteri commented 6 years ago

For what it's worth, I have some code doing this, and I just cleaned it up a bit. It looks like this:

struct BitBuffer {
    bits: u32,
    len: u32,
}

impl BitBuffer {
    fn new() -> BitBuffer {
        BitBuffer { bits: 0, len: 0 }
    }

    fn fill<R: Rng + ?Sized>(&mut self, rng: &mut R) {
        self.bits = Rng::next_u32(rng);
        self.len = 32;
    }

    // gets n random bits
    fn get_bits<R: Rng + ?Sized>(&mut self, mut n: u32, rng: &mut R) -> u32 {
        assert!(n <= 32);
        if n == 0 {
            return 0;
        }
        if self.len == 0 {
            self.fill(rng);
        }
        let mut ans = 0;
        if n > self.len {
            n -= self.len;
            ans |= self.bits << n;
            self.fill(rng);
        };
        self.len -= n;
        ans |= self.bits >> self.len;
        self.bits &= !(!0 << self.len);
        ans
    }

    // finds the first significant bit, if available in first max_bits bits
    fn first_one<R: Rng + ?Sized>(&mut self, max_bits: u32, rng: &mut R) -> Option<u32> {
        let mut prev_leading = 0;
        loop {
            // count only leading zeros in available bits
            let add_to_leading = self.bits.leading_zeros() + self.len - 32;
            let leading = prev_leading + add_to_leading;
            // check if we have no ones in initial max_bits
            if leading >= max_bits {
                // skip consumed zeros
                self.len -= max_bits - prev_leading;
                return None;
            }
            // check if we have some significant bit
            if add_to_leading < self.len {
                // skip not just zeros, but also first significant bit
                self.len -= add_to_leading + 1;
                self.bits &= !(!0 << self.len);
                return Some(leading);
            }
            // we do not have enough bits yet
            prev_leading = leading;
            self.fill(rng);
        }
    }
}

// The random number generated is 0.*, where * is an infinitely long
// stream of random bits.
//
// If * starts with (BIAS - 1) zeros, the number is subnormal. For a
// subnormal number, we set the biased exponent to zero and the
// mantissa bits to the random bits after the leading (BIAS - 1)
// zeros.
//
// For a normal number, we set the biased exponent to (BIAS - 1 -
// leading zeros), then we skip the one bit that stopped the stream of
// leading zeros which will be encoded implicitly. Then we set the
// mantissa bits to the random bits after that first significant bit.
//
// For both subnormal and normal numbers, we have to check rounding.
// We are rounding to the nearest. A tie is not possible since * is an
// infinitely long stream of random bits; so in this case we get a
// random bit to decide whether to round down or up.
pub fn uniform01_f32<R: Rng + ?Sized>(rng: &mut R) -> f32 {
    const BIAS: u32 = 127;
    const MANT_BITS: u32 = 23;

    let mut bits = BitBuffer::new();
    let mut u = match bits.first_one(BIAS - 1, rng) {
        None => 0,
        Some(first) => (BIAS - 1 - first) << MANT_BITS,
    };
    u |= bits.get_bits(MANT_BITS, rng);
    // Handle rounding. For rouding up, just add 1 to u, which always
    // moves to the next floating-point number above.
    if bits.get_bits(1, rng) == 1 {
        u += 1;
    }
    unsafe { mem::transmute(u) }
}
pitdicker commented 6 years ago

Good explanation. That is mostly how the source of the idea was described in https://readings.owlfolio.org/2007/generating-pseudorandom-floating-point-values/ (without a practical way to implement it though).

If we are only thinking in terms of Distributions I think your method 2 can work, requesting more random numbers when necessary. It should not even be much or any slower. Yet I think a conversion function that just uses all bits of one round of the RNG is nice, a function that can generate every possible floating point value between [0, 1) seems overkill. One reason I did not go that route initially is that for f64 it could require up to 16 u64's to pick the exponent, although that is extremely unlikely.

tspiteri commented 6 years ago

Requiring 16 u64 values is so unlikely the universe could well reach its heat death before that happens 😄. Unless the random number generator is just producing zeros, but I think that would only be useful in testing.

pitdicker commented 6 years ago

Found something interesting: suppose you want to generate a random f64, that can be any possible value in the range [0, 1). The exponent can have a negative up to -1024. The mantissa is 52 bits. In total 1076 bits are needed from the RNG.

The chance to get a certain value depends on how close it is to zero, with the smallest values having a chance of 1 in 1076. To be able to generate such a number the RNG must have a period of at least 2^1076.

pitdicker commented 6 years ago

Didn't see your post before commenting.

For myself I work with the number that today's hardware on a single processor can generate about 2^30 values per second. That means ~2^55 per year, and 2^64 takes about 500 CPU-years. Any conversion method that just takes a single u64 seems good enough to me.

tspiteri commented 6 years ago

Yeah, it is pretty unlikely to require two u64 values. The method I posted above only becomes really significant for low precisions, e.g. with half::f16.

dhardy commented 6 years ago

So, to summarise, we have:

  1. Generate values uniformly distributed between 0 and 1-ε with step ε (optionally offset)
  2. Generate values effectively from an infinite bit sequence rounded to representable values
  3. Generate values with as much precision as available from a single u32 / u64

Separate to that, we have five questions:

  1. Is the mean value close to 0.5? In practice as long as the difference is no greater than ε it probably doesn't matter.
  2. What is the probability of sampling 0? Mostly we only care whether this is possible.
  3. What is the probability of sampling 1? Again, mostly only care whether possible.
  4. How much precision is useful for numbers close to 0?
  5. Do we care at all about generating random FP numbers directly or will we always convert from integers? [So far I've seen very little interest in direct FP generation and only the dSFMT generator.]

Are these details likely to matter?

For many cases, the exact details are not important. In some ways, allowing sampling of 1 but never 0 makes more sense as a default than the status quo, but I'm uncomfortable both making this breaking change and bucking the norm.

Do we need [0, 1), (0, 1], [0, 1] and (0, 1)? No. Arguably (0, 1] alone should satisfy most use-cases. But I'm not sure which is the best sub-set.

Not really sure at this point, but benchmarks of various optimised implementations would be nice.

Edit: added questions 4 and 5 and a few details.

pitdicker commented 6 years ago

Of the variant "Generate values uniformly distributed between 0 and 1-ε with step ε (optionally offset)", the ranges [0, 1), (0, 1) and (0, 1] can all be generated with exactly the same performance.

Step 1 is always to make a float in the range [1, 2). Then subtract 1.0, 1.0-ε/2 or 1.0-ε respectively.

dhardy commented 6 years ago

Which implies it would be nice to move the next_f32 impl out to a distribution/Rand implementation where all variants can be implemented uniformly, and drop Closed01.

Note: I had a glance at dSFMT and it looks like it's not a "native FP generator" anyway, but works similar to some of the methods mentioned here, except that it only generates 52 random bits.

pitdicker commented 6 years ago

Which implies it would be nice to move the next_f32 impl out to a distribution/Rand implementation where all variants can be implemented uniformly, and drop Closed01.

👍 from me.

A benchmark (warning: tempered with 😄; I changed the names):

test gen_u32_xorshift       ... bench:       1,371 ns/iter (+/- 4) = 2917 MB/s
test fixed_01_f32           ... bench:       1,599 ns/iter (+/- 2) = 2501 MB/s
test open_closed_01_f32     ... bench:       2,629 ns/iter (+/- 8) = 1521 MB/s
test open_open_01_f32       ... bench:       2,761 ns/iter (+/- 15) = 1448 MB/s
test closed_open_01_f32     ... bench:       2,521 ns/iter (+/- 14) = 1586 MB/s

Xorshift is included for reference, the time the conversion takes is the difference with Xorshift. fixed is for the three ranges that generate a float with ε precision. The other three are my higher precision attempts.

The three higher-precision variants are ±3× slower than the fixed precision variants. Generating f32s with a generator like Xorshift is 15% vs. 50% slower than generating integers. Using an Xorshift variant that works on u64s to generate f64s should have similar results.

The large difference in performance is (was) a bit surprising to me. Besides movs etc. fixed takes just 3 operations. closed_open takes 7 operations (some of which can be done in parallel) and 1 branch. But if I look at the assembly of closed_open01 it contains two more branches than expected. Not sure what is going on yet, looks like trailing_zeros does not work optimally...

pitdicker commented 6 years ago

Hmm. Turns out the extra branch and jump are inserted because counting the number of trailing zero's of 0 is undefined.

Manually setting a bit makes LLVM optimize out the branch (2,505 ns/iter (+/- 13) = 1596 MB/s):

            #[inline(always)]
            fn closed_open01(self) -> $ty {
                const MIN_EXPONENT: i32 = $FRACTION_BITS - $FLOAT_SIZE;
                #[allow(non_snake_case)]
                let ADJUST: $ty = (0 as $uty).binor_exp(MIN_EXPONENT);
                                  // 2^MIN_EXPONENT

                // Use the last 23 bits to fill the fraction
                let fraction = self & $FRACTION_MASK;
                let remaining = self >> $FRACTION_BITS;
                // Use the remaing (first) 9 bits for the exponent
                let exp = - ((remaining | (1 << $FRACTION_BITS))
                            .trailing_zeros() as i32);
                if exp >= MIN_EXPONENT {
                    fraction.binor_exp(exp)
                } else {
                    fraction.binor_exp(MIN_EXPONENT) - ADJUST
                }
            }

Checking if the value is not zero is faster, but LLVM still inserts an unnecessary extra conditional jump (2,455 ns/iter (+/- 11) = 1629 MB/s):

            #[inline(always)]
            fn closed_open01(self) -> $ty {
                let fraction = self & $FRACTION_MASK;
                let remaining: $uty = self >> $FRACTION_BITS;
                if remaining != 0 {
                    let exp = remaining.trailing_zeros() as i32;
                    fraction.binor_exp(-exp)
                } else {
                    const MIN_EXPONENT: i32 = $FRACTION_BITS - $FLOAT_SIZE;
                    let adjust: $ty = (0 as $uty).binor_exp(MIN_EXPONENT);
                                      // 2^MIN_EXPONENT
                    fraction.binor_exp(MIN_EXPONENT) - adjust
                }
            }

Anyway, a little optimization is still possible, the limit is probably something like 2,400 ns/iter = 1650 MB/s (to compare with the above benchmark). Just somehow have to convince LLVM...

dhardy commented 6 years ago

So would a good strategy be to generate fixed-precision [0, 1) by default, with some variants available as options?

If you're both happy with the rough plan, I guess we start by merging https://github.com/rust-lang-nursery/rand/pull/237 (or my reshuffle of it), then open PRs for each other change. @tspiteri @pitdicker?

tspiteri commented 6 years ago

Initially I was arguing that the random in [0,1) should have a fixed step (basically ɛ) all through the range; probably subconsciously influenced by the issue mentioned in for example https://docs.oracle.com/javase/8/docs/api/java/util/Random.html#nextFloat-- where the division could cause nonuniformity (BTW, the documented correct Java nextFloat has steps of ɛ/2). But looking at @pitdicker's approach that uses the extra bits, it is not using something equivalent to the division mentioned in the Java doc which rounds to nearest, his algorithm is equivalent to division with truncation, where the issue disappears. So now I do not have reservations about using the extra bits any more, after all they are free in the sense that the random generator has already generated them.

I like the [1,2) range; it is a nice way to provide fixed precision with fixed steps for the cases where those properties are required, and it is provided in a way that does not need to be over-documented as there is only one obvious way to generate random numbers in [1,2), that is steps of ɛ. And including this is good for completeness; for example it would be incorrect to just use [0,1) range with extra precision and then add one and subtract one again to try to force equal steps, as that would cause the rounding nonuniformity mentioned in the linked Java doc.

And I also like that the [0,1] interval with maximum precision is included, as I see this as generating a random number with a continuous uniform distribution with correct rounding.

In short, I like @dhardy's proposal.

pitdicker commented 6 years ago

So would a good strategy be to generate fixed-precision [0, 1) by default, with some variants available as options?

  • [1, 2) fixed precision

I am against including this range, but do see some interesting uses. Two reasons:

To be honest I don't see the use. Well, I can see some use, but think there are more 'strange' ranges that could be interesting. And where do you stop?

It is not the only 'special' or 'obvious' range As an example, the fastest way to generate a number in the range [-1, 1) is to make a float in the range [2, 4) and subtract 2.0.

It is not a useful range for users, it only reveals an implementation detail The [1, 2) is nice conceptually, but I can't imagine many user wanting to use it directly. You just about always will have to convert it to the range you need. But is [1, 2) somehow a more useful range for performance?

We can do something similar (and better) in the Range distribution With fixed precision there is one subtraction involved to get the range down to (0, 1). To convert it to the right range requires at most one addition and one multiply. Is LLVM able to combine the two addtions? That would mean we have to do the addition first in the range code. Or could it even combine them and do a fused multiply–add on ARM? Then we would have to do the multiply first. Probably all not without some fast-math flag.

If we write a custom conversion in the Range code that first makes a float in the range [1, 2), then multiplies it, and does a single addition, I think our range code is as optimal as it gets (for a generic function).

  • (0, 1) fixed precision

I would pick this one, (0, 1) as a default. We figured out elsewhere that the guarantee to not return some value is more interesting than the guarantee to generate it. The change to generate 0.0 with [0, 1) is just 1 in 2^53 (for f64). So for many uses it will never appear, and the guarantee is not very interesting. While the guarantee to not generate 0.0 is necessary for the correctness of some algorithms.

  • [0, 1) 32 / 64 bit precision (@pitdicker's code above)

I think the benefit is too small to make it the default, considering the performance is less than fixed precision.

  • [0, 1) — @pitdicker's approach with extended precision?

I don't think any more precision buys us much. Compared to the fixed precion method using 32/64 bits of precision will produce a number with more precision in about 50% of the values. The next step, almost infinite precision, will on average only once every 512 values produce something with 1 bit extra of precision (and only once in 4096 for f64).

The cost is that the number of rounds used from the RNG becomes variable, instead of this being a simple conversion function. But I don't think it will be much slower.

  • [0, 1] maximum precision (infinite bit-sequence rounded to representable values)?

Aren't these two basically the same, as far as the end user is concerned? I am curious what the performance from @tspiteri's code from https://github.com/dhardy/rand/issues/85#issuecomment-358974314 is, but don't imagine it to be great to be honest...

So in the end I would only pick two ranges:

Of all the possible other ranges I do think the [-1, 1) range is interesting (possibly with different closed/open bounds) because I encountered it in quite a few algorithms. Especially with 32/64 bit precision it becomes useful, because the numbers with higher precision are not concentrated near the start of the range, but near the middle. But probably best to keep that out of the discussion for now.

pitdicker commented 6 years ago

Just to keep this somewhere... I tried to find an alternative to counting zero's. A nice trick is to convert the bits that are not used for the fraction (remaining) from an integer to a float. This will give a float with the logarithm of the number encoded as exponent. The exponent only needs to be adjusted to be negative to get the right value. It is not necessary to mask out the fraction of this number, just XOR-ing it with the random fraction bits is enough (instead of OR).

fn closed_open01(self) -> $ty {
    let fraction = self & $FRACTION_MASK;
    let remaining = self >> $FRACTION_BITS;
    if remaining != 0 {
        unsafe {
            let mut exp: $uty = mem::transmute(remaining as $ty);
            exp -= ($FLOAT_SIZE - $FRACTION_BITS) << $FRACTION_BITS;
            mem::transmute(fraction ^ exp)
        }
    } else {
        const MIN_EXPONENT: i32 = $FRACTION_BITS - $FLOAT_SIZE;
        let adjust: $ty = (0 as $uty).binor_exp(MIN_EXPONENT);
                          // 2^MIN_EXPONENT
        fraction.binor_exp(MIN_EXPONENT) - adjust
    }
}

But in the end the performance is slightly less than the zero counting method (2,553 ns/iter (+/- 35) = 1566 MB/s).

dhardy commented 6 years ago

@pitdicker if we can get fast conversion to specified ranges via Range that would be awesome. It does sound a little complicated though — there is open vs closed (unless we just do open everywhere) as well as multiple possible precisions.

(0, 1) as a default sounds fine. It's a change from the current [0, 1) but not one that most users would notice.

0-1 with 32/64 bits of precision may be a useful compromise sometimes. Not sure.

Having a maximal precision 0-1 range would be useful sometimes I think — though since it is not quite trivial to convert to a high-precision [-1, 1 ] range (or other range spanning zero), support via Range would be nice.

Maybe in the end we should have two or three range types:

vks commented 6 years ago

I had a glance at dSFMT and it looks like it's not a "native FP generator" anyway, but works similar to some of the methods mentioned here, except that it only generates 52 random bits.

IIRC, dSFMT provides some fn fill(&mut [f64]) function that is a lot faster than generating floats from 64 bit integers one-by-one. I'm not sure how strongly that is coupled to the implementation of the RNG.

pitdicker commented 6 years ago

It does sound a little complicated though — there is open vs closed (unless we just do open everywhere) as well as multiple possible precisions.

I am more and more leaning towards not caring much for closed/open ranges with floating point ranges, because after many hours of trying I still can't get all the rounding errors out of the floating point range code (first try back in September? last try in January).

For some ranges it is even impossible to represent the size of the range, while the start and end can be exactly represented. In those cases there is no way to get the bounds 'right'. So for primitives like the (0, 1) range worrying about the exact bounds makes sense, but for arbitrary ranges I think we are already doing well if the result stays within the bounds :cry:.

For integer ranges I once had the thought that both the current range code and a less uniform one make sense. Now we lose some performance to checking the zone to sample from. While most ranges in an u32 are so small that not checking the zone will have negligible bias. We could add a RangeFast distribution or something.

Than for both integer ranges and floating point ranges we would have a choice between high accuracy and best performance.

I am getting a little more time again, and see you distribution PR has landed :tada:. Shall I try to work on this?

pitdicker commented 6 years ago

Did some tests, both 32/64 bits precision and maximum precision have the same performance. Let's just go with maximum then. Definitely better than having three choices.

dhardy commented 6 years ago

Yes, I decided to merge that. I also completed a merge back into my master branch — phew, that was hard! So now is probably a good time to bring over (or redo) FP sampling / range code if you're interested.

dhardy commented 6 years ago

I think perhaps we should keep the [0 ,1) (i.e. half-open) range available, not necessarily as the default, but because it is widely used and quite useful if someone wishes to try reproducing results from other languages in Rust, and also because of patterns like rng.gen() < p and let index = (rng.gen() * (len as f64)) as usize. We may recommend users don't use those patterns, but they may still be used, e.g. in direct code translations.

vks commented 6 years ago

What is left to do here? I don't think we need HalfOpen01, because it is equivalent to Range::new(0., 1.), at least it should be after adding some #[inline] attributes.

dhardy commented 6 years ago

Yes, probably nothing. But lets leave this open at least until we have some high-precision range integrated.