Shootout: small, fast PRNGs

dhardy commented 6 years ago

We've already had a lot of discussion on this. Lets summarise algorithms here.

This is about small, fast PRNGs. Speed, size and performance in tests like PractRand and TestU01 is of interest here; cryptographic quality is not.

Edit: see @pitdicker's work in #60.

dhardy commented 6 years ago

but if we want to support u64 output how do we do an XSH based version without u128 support in stable rust?

We already have an i128_support feature flag; I'd suggest putting code behind that flag and waiting until https://github.com/rust-lang/rust/issues/35118 is closed (in the mean-time it could still be the default generator when that flag is enabled). If you roll your own assembly the code won't be available on all platforms anyway, plus it seems like a waste of effort.

Edit: from the discussion on that issue I'm unclear if 128-bit support will ever be fully implemented for all platforms, so we may need another generator anyway as an alternative. I'm okay with WeakRng using a different generator on some platforms if we disable seeding support. The named version (PCG64_*) could even stay behind a feature gate with documentation of where it's available.

Edit2: sorry, sounds more like i128 support should be available everywhere eventually, albeit emulated on some platforms. That still means we may want an alternative u64 generator available.

Lokathor commented 6 years ago

@dhardy yeah the trouble is that to support output of width foo using the XSH based PCG permutation you need to have math for numbers of width 2*foo. So the u128 support isn't needed for the optional next_u128 method behind a feature flag, it's needed for the required next_u64 method that's part of the base rand_core setup. The PCG that I wrote up and linked to simply uses an alternate permutation to achieve next_u64 without use of feature flags (the alternative permutation used can give output as big as its own state, but it's notably slower when measured per-call).

@Ichoran There's a lot of ways to assemble a PCG, that's part of the fun, and I can't comment on a particular generator that hasn't been written yet, but I strongly suspect that your theoretical generator with more bits devoted to state and then using a fixed inc value will either have to:

Rely on 128 bit math, making it slower wherever that's not supported by the hardware (including my poor raspberry pi 3).
Rely on using a state extension scheme such as the one discussed in the PCG paper, which will also obviously be slower just because you're doing more operations (even if they are all 64-bit operations).

I'll take the raw speed myself. That's why the officially suggested "minimal generator" is a 64state/32output PCG variant with inc selection like I'm using. It's a good middle ground between speed, total size of the generator, and having selectable streams.

If you really wanted to you could even have a variant where inc is selected but it gets auto-bumped whenever the state transition lands on zero, which would give you an effective period of something like 2^127 or so (total generator space of still only two u64 / 16 bytes). That's probably the simplest extension scheme available, and plenty of generator period.

Note also that you don't have to pick between only a fixed add or an inc value sized to your state size. If you're really concerned about the total space per generator you can use a smaller size type for the inc value to save on bytes. Eg: a u8 for inc would have 2^7 streams and only one byte used.

Ichoran commented 6 years ago

@Lokathor - It would be good to benchmark the automatic stream switching, because such a generator would have considerably better quality when you just want a lot of numbers. I agree that it should work. Also, 64_64 is not recommended, yet consuming 64 bit numbers is a huge use case for e.g. scientific computing (where you may want a lot of random f64s).

vks commented 6 years ago

Rely on using a state extension scheme such as the one discussed in the PCG paper, which will also obviously be slower just because you're doing more operations (even if they are all 64-bit operations).

This is not true in general due to instruction parallism. For instance, xorshift1024 is faster than xorshift128.

Did anyone check that neighboring streams are uncorrelated? This would not be caught by the standard test suites. If they are correlated, it can be dangerous to use them in parallel simulations. I would prefer the approach proposed by Vigna: skip 2^64 values instead of picking a new stream. This does not have the problem of streams because it's just using the tested generator instead of using a slightly different one.

On Thu, Nov 30, 2017, 01:16 Ichoran notifications@github.com wrote:

@Lokathor https://github.com/lokathor - It would be good to benchmark the automatic stream switching, because such a generator would have considerably better quality when you just want a lot of numbers. I agree that it should work. Also, 64_64 is not recommended, yet consuming 64 bit numbers is a huge use case for e.g. scientific computing (where you may want a lot of random f64s).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dhardy/rand/issues/52#issuecomment-347930958, or mute the thread https://github.com/notifications/unsubscribe-auth/AACCtJ2ZB7YjPpxGYvhN4s4S1RAlCDZyks5s7ZFfgaJpZM4QexFU .

dhardy commented 6 years ago

@Lokathor yes, I understand this 64 bit generator is dependant on 128-bit computation. That's why I said we need an alternative available. weak_rng is not required to produce the same results on different platforms (in fact it never will due to not being seedable).

Lokathor commented 6 years ago

@vks I was referring to the specific extension scheme described in the PCG paper. It has a data dependency between each sub-state advance. If you go with a Xoroshiro style and just use a cycling selector and an array of states that would also work I guess.

As to your other concern, I will let the expert speak: http://www.pcg-random.org/posts/critiquing-pcg-streams.html

Lokathor commented 6 years ago

@dhardy the generator variant Icoran mentioned, the PCG XSL RR 64/64 (EXT 2), would be entirely suitable for 64 bit generation in terms of both quality and period. The only reason that 64/64 is normally unsuitable is because too small of a period forces each u64 to appear only once. With the extension effect applied to give a bigger period it's back to being very nice (and no 128 bit support required).

pitdicker commented 6 years ago

Wow, lots of idea's here!

@Ichoran

Slow 128-bit math on 32 bit platforms is an issue. If we can't get around that, we would need something that would be called PCG XSL RR 64/64 (EXT 2), which ought to be around 20% slower than the 128-bit version.

I don't believe 'PCG XSL RR 64/64 (EXT 2)' is a thing. The XSL part uses some of the bits to permute 32 of the other bits. Only the 'RXS M XS' variants are designed to give the same number of bits as output as the state size. That one uses multiple passes that each take a few different bits to permutate the others.

I wonder what happens when you combine the RXS M XS variant with the state extension. In the PCG paper chapter 3.1 there is a good explanation why the RNG should have more bits of state than bits that it outputs. This makes the RXS M XS variants normally not recommended, and I wonder if the problem remains with the extended state.

@Lokathor

Hmm, hold on a minute. You say that "The RXS M variants are definitely the fastest of all the permutations", but in my (limited I admit) benchmarking the rxs_m_xs_64_64 version was around 30% slower than xsh_rr_64_32.

I was measuring in terms of calls / second.

I was comparing RXS M 32/32 with RXH 64/32, that both generate u32's. And RXS M 64/64 with RXH 128/64, that both generate u64's. Otherwise it did not seem fair to me :smile:.

@Ichoran

That looks good except that the use of an extra 8 bytes just to store inc makes the generator considerably less attractive: the period is as short as a generator of half the size, but for cases where very many random number streams need to be kept track of, the needed state is doubled.

We basically need to store at least 128 bits of state, or state + stream for easy initialization. A period of 2^64 combined with streams can be enough for just about every use. See these two comments for some of the math.

@Lokathor

If you really wanted to you could even have a variant where inc is selected but it gets auto-bumped whenever the state transition lands on zero, which would give you an effective period of something like 2^127 or so (total generator space of still only two u64 / 16 bytes). That's probably the simplest extension scheme available, and plenty of generator period.

If I understand what you are saying, you want to increment the stream value every time a full period of the RNG has passed. I don't think this really works, for a few reasons. Generating all numbers in a period, 2^64, is a huge amount, taking more than a century on my pc. That is not really why a large period matters. Doing an extra comparison in a function that has only 7 operations will surely slow it down. And to detect the period is completed you need to compare against the seed, which will increase the state size.

@vks

I would prefer the approach proposed by Vigna: skip 2^64 values instead of picking a new stream.

This has one big problem that make it almost unusable. You need to keep the state where you jump from to create the new stream somewhere shared. Imagine one RNG is split in two for two threads using a fixed jump. One RNG will remain the same, and the other will be 2^64 rounds further. Now imagine both threads want to split their RNG again using fixed jumps. There are now 4 RNG's, but two of them are the same. Trying to solve this problem using synchronisation or elaborate schemes is harder and slower than just initializing a new RNG.

Ichoran commented 6 years ago

@pitdicker - You're right, of course, that I meant RXS M XS 64 (EXT 2). I don't think there's any reason it wouldn't work but it might require a more computationally expensive extension than the default. In any case, I think you make a good case that a period of 2^64 with 2^62ish streams is plenty. This leaves two candidates: RXS M XS 64 with streams, and XSH RS 64/32 (EXT 2) without. Since the former is faster on 64 bit architectures and already implemented, I'd go with that.

There is one usability wrinkle with regard to streams, though: right now, streams 2*n and 2*n+1 are identical. If people are going to futz with the stream counter directly, this could lead to problems (e.g. "increment by one" is the most natural way to get a new stream, but it doesn't work half the time). On the other hand, I don't think we can afford the computation to scramble the stream index for every random number generated.

Lokathor commented 6 years ago

@pitdicker I don't actually think that a PCG with automatic stream jumping is a good plan (or I would use it myself), but it doesn't require that you store the initial state. Instead, you jump streams when your state lands on 0 after the state transition mult+add computation. This is very similar to the extension scheme suggested in the PCG paper. Assume that your generator has two streams (A and B) and that you start in stream A at some point "past" the 0 mark which we'll call P. So you use up all the rest of stream A, jump to B, use up all of B, then jump back into A and use up the beginning of the stream to point P. You've completed a perfect loop, despite not storing the value of P for stream-jump comparison. Of course, as you say, with a state value that's 64 bits, generating one number per nanosecond, that's still over 500 years.

The other simple extension scheme is quite possibly better for speed, but doesn't guarantee K-dimensional equidistrubtion like the above scheme does: a simple array of state values, and a counter that cycles through array positions of the current state to be used. This is the "xoroshiro" extension scheme we might call it.

@Ichoran Regarding streams: the "futzing" is just inc|1, which is cheap but not free. If you want to avoid paying even that much of a cost all you do is make the inc value get the |1 operation applied to it ahead of time whenever it's being set (either at generator creation time or as part of a "setter" if the API allows that). You also want to provide a split operation as part of the generator that does the right thing in terms of adjusting the inc value (wrapping add some big, even number).

Here's the catch there: either the generator that you call split on can stay in the same stream and it can produce a generator in the "next" stream, and then repeated splits have to be called on the most recently produced generator, or the generator that you call split on can move itself into the new stream and have the generator it outputs be in its own old stream. This means that you could keep calling split on the same generator over and over and get the right results. If you look back that the full example code I linked above you'll see that I went with the former design, but that's because I liked having inc fixed after generator creation. If you didn't want to stick to that as an invariant (and it's not an important invariant at all) then having the generator you call split on shift its stream would probably be the better scheme.

vks commented 6 years ago

@pitdicker I don't think it's such a big problem. In massively parallel calculations you often have a constant number of threads, so you just initialize your RNGs in the beginning. And don't you have the same problem with streams? You can't just pick them randomly because of the birthday paradox.

Ichoran commented 6 years ago

@Lokathor - The problem is not primarily that inc|1 is so expensive, it's that if inc == 2 and someone thinks, "Oh, I'll advance to the next stream, rng.inc += 1;!", they will have not done anything at all.

I think it's very important to make it hard for this to happen. Either with more math, e.g. (inc << 1) + 1, or by making it apparent that you need to use a next_stream function. I'd prefer the latter, as it allows a slightly more robust randomization scheme, and the generator may run a bit faster.

Lokathor commented 6 years ago

Yeah, exposing inc as a publicly editable field is a really bad plan. That's why I don't do it.

Ultimately, we have to consider what the MultiStreamRng trait looks like. Or whatever other dumb name we give the trait for generators where you can pick a stream.

Ichoran commented 6 years ago

@Lokathor - I just looked at your implementation, and while you don't expose inc directly, you only do inc | 1 in new, and you also expose the existing value of inc via a function. So it doesn't take very much for someone to decide to generate a family of streams and accidentally end up with half of them being duplicates.

This is easy to fix, though, just by doing a little bit-mixing on the input.

pitdicker commented 6 years ago

@vks

And don't you have the same problem with streams? You can't just pick them randomly because of the birthday paradox.

I thought so too, but it is not as bad as it seems. A quote from myself:

Lets take 2^48 as an upper limit for the expected amount of used results. A period of only 2^64 is just about enough to have a chance of 1 in 2000 for one new RNG to end up within the stream of one other RNG. If the RNG has 2^63 possible streams like PCG, 2^27 initializations are possible before the stream has a change of 1 in 1000 of being the same.

Combined this means it takes 2^27 initializations to get a chance of about 1 in a million before part of a window of 2^48 results is reused. Seems good enough to me.

I really tried to get a scheme using Xorshift jumps to work. In our conversations I may sound negative, but I like the Xorshift variants. Especially the papers of Vigna I have studied several times, and played with the code supporting his papers.

I wrote a jumping function, and tried calculating a custom jump polynomal. A fixed jump takes as many rounds of the RNG as it has state bits. Together with bookkeeping, this makes a jump as slow as requesting new bytes from OsRng.

I also experimented with variable jumps. For now I calculated the jump polynomal by hand in Excel :-(. A variable jump is at least several times slower than a fixed jump.

An other idea was to use variable jumps that are multiples of 2^50. Picking them at random gives a similar chance of duplicate streams as PCG streams. But this improves on the birthday problem only a little, while taking very much more time.

In the end using jumps to make the birthday problem when initializing RNGs less of an issue, seemed to not really work out. Unless you have a single 'mother' RNG that is jumped every time a new 'child' RNG is split off. And even then it is slow.

Lokathor commented 6 years ago

@Ichoran I admit that my code is not fully idiot proof because it's mostly intended for only me to ever use it, but I already also provide split and split_many as a way to guide users in the correct direction in terms of stream selection. I also document over and over that the inc value passed must be odd, and if it's not it gets bumped up.

One thing is though that MultiStreamRng should have stream count and stream selection specified by u64, not usize. The local machine size doesn't have any effect on how many streams your generator will support or not, best not try and fool people into thinking it's somehow related.

@pitdicker I forget my birthday problem formula exactly, but your math seems wrong just because its logic is wrong. The period of the PCG used has absolutely nothing to do with stream selection overlap. It's not the case that the inc shifts you "forward" by some amount within a single sequence of numbers that the generator as a whole always uses, each inc value produces a fully distinct sequence of numbers. "State 0 Stream 3" is not the same as "State 0+2steps Stream 1", so you can't compare them for your birthday calculation.

pitdicker commented 6 years ago

@Lokathor That is just a very short conclusion I pasted there. But I calculated the chance to end up in a stream that was already used before. That results gets multiplied by the chance to end up in the same 'window' of used results from the period, from that stream.

Lokathor commented 6 years ago

Ah, my mistake then.

vks commented 6 years ago

@Lokathor

If the RNG has 2^63 possible streams like PCG, 2^27 initializations are possible before the stream has a change of 1 in 1000 of being the same.

I'm getting this result as well with the approximate solution of the birthday problem:

P(n, d) = 1 - exp(-n*(n-1)/(2*d))

pitdicker commented 6 years ago

@Lokathor

Defaulting to the full next_u64 process and then throwing away half the bits is probably a bad choice if we can make next_u32 faster by using an alternate permutation. The 32-bit stream of an Rng already isn't going to match the 64-bit stream, so we might as well make it go as quick as we can.

I just tried this out for the XSL RR 128/64 (MCG) variant. The advantage of a custom permutation is that the truncation to u32 can happen earlier. Because on x86_64 64-bit operations are about as fast as 32-bit, it did not change the benchmarks at al... On x86 the speed was already abysmal, and it remained so.

And for the 64/32 variants there is not much creative we can do with the output functions, right?

pitdicker commented 6 years ago

Wow, the extension method for PCG is complex! And the C++ template stuff and lack of comments don't help either.

I tried implementing pcg32_k2_fast. This is the PCG-XSR-RR 64/32 variant, with MCG as a base generator and an extension array of 1 word. From its description "pcg32_k2_fast occupies the same space as pcg64, and can be called twice to generate 64 bits, but does not require 128-bit math; on 32-bit systems, it's faster than pcg64 as well."

That claim does not seem true though, as it is implemented with an extension array of 32 32-bit words. The file check-pcg32_k2_fast.out says it has a period of 2^2112, and a size of 264 bytes. But that also seems strange, because that assumes 64-bit words in the extension array, while the number of bits should be the same as the number of bits the RNG outputs, 32 in this case.

The extension mechanism comes with two choices: we can pick a size, and whether we want k-dimensional equidistribution (kdd). It is best if the size is a power of two, this makes the point when the extension table should be updated easier to recognise.

To generate a new random number, the output of the base generator (PCG-XSR-RR 64/32 in my case) is xored with a randomly picked value from the extension array.

Which function to use to pick a value from the extension array depends on whether we want kdd. The PCG paper explains:

The selector function can be an arbitrary function, but for simplicity let us assume that k = 2^x and that selection merely drops bits to choose x bits from the state. Two obvious choices for choosing x bits are taking the high-order bits and taking the low-order bits. If we wish to perform party tricks like the ones we discussed in Section 4.3.4, it makes sense to use the low-order bits because they will access all the elements of the extension array before any elements repeat (due to the property of LCGs that the low-order l bits have period 2^l). Conversely, if we use the high-order bits, the extension array will be accessed in a more unpredictable pattern.

Sometimes the values in the extension table need to be updated. PCG chooses for the following scheme:

A 'tock' happens every time the base RNG makes a full period, than the extension array is advanced. If the base RNG has a state of 64 bits or more, it is safely assumed this will never happen (and compiled out).
A 'tick' happens if the number of words in the extension array is less than the number of bits in the base RNG state. ~~All values in the extension array get advanced once they are all consumed (in the kdd case, otherwise when they are all consumed on average).~~ Edit: All values in the extension array get advanced once 2^(extension array size) rounds.

Every value in the extension array is its own little PCG-RXS-M-XS RNG. The process to update a value is complex, slow, and in my opinion ugly. First the inverse of the RXS-M-XS output function is applied. Which includes using a recursive un-xorshift function twice, and multiplying with the modular inverse of the multiplier of RXS-M-XS. Than the state recovered with that is advanced as if it is an LCG. Next the RXS-M-XS output function is applied to get the new value for the extension array. Repeat until all values are updated.

In about 50% (?) of the cases a value is also advanced a second time, to break the different array values out of lockstep. If the base generator is an MCG at least.

I did not finish my implementation of PCG with an extension array. It certainly does not fit in the 'small, fast RNG' category. I suppose the PCG EXT variants only look so fast in the benchmarks if the whole table update part does not happen.

Lokathor commented 6 years ago

Yeah, the extension for k-dimensional equidistribution is tricky. The Xoroshiro-style extension with cycling through array positions doesn't give k-dimensional but it is dead simple. Unfortunately, I'm pretty sure (I think?) that it doesn't give more period length very quickly. 2^n state slots in an array gives (again, i think) +n to your period (eg: 2^64 becomes 2^(64+n) instead).

I'm not sure what a good answer is, but I'll try to keep thinking about it in the next few days.

vks commented 6 years ago

@Lokathor I'm pretty sure the period is as large as it can be, since there is only one cycle for these RNGs.

Lokathor commented 6 years ago

False. Please read the PCG paper.

Ichoran commented 6 years ago

It really is worth reading the PCG paper. It's amazingly clear and approachable for what often seems like an arcane and difficult topic.

With regard to 2^n state slots, yes, you end up repeating yourself after 2^(m+n) where m is the period of the underlying generators; the proof is trivial since every 1/2^n times you advance one slot, and so the first slot will be advanced back to the beginning--2^m advances for itself--after 2^(m+n) steps. Since each slot gets the same number of advances, that reasoning applies to every slot.

So you need a more complex scheme, some of which have self-similarity problems. So I'd agree with @pitdicker that it isn't really a small fast RNG any more. There aren't many instructions executed per clock cycle, but the logic for advancing is somewhat complex, and not completely trivial mathematically. (I'd still characterize it as straightforward, but there are plenty of opportunities for implementation errors.)

Anyway, is there a compelling reason not to just pick one of the non-extension schemes and have that be the default? Having fast yet decent-quality random numbers (even if on some architectures it's not as fast as others) seems like a sizable improvement over the status quo; and you can always leave in the existing implementations for people who have reason to prefer the old algorithm.

The nice thing about the PCG family is that not only are the algorithms close to as fast as they can be, there's also a theoretical framework that helps reassure us that it's unlikely that a really problematic non-random structure is lurking in there somewhere that just doesn't happen to be tested with the typical tests. This is a great reassurance for a standard library to have. (Note: we only have that reassurance for a single stream, not for comparison between multiple streams.)

Lokathor commented 6 years ago

Well, the only problem is that the default PCG you want kinda depends on the output you want. If you want mostly u32 values then 64/32 will probably serve you better than 128/64 simply because it's less space taken and it's somewhere between slightly faster to much faster depending on machine (it's not unreasonable to think that rust will regularly be run on 32 bit devices as well).

We could provide both and then explain why you'd want one or the other. Later on we might even be able to provide a pcg extras crate with macros that build a PCG type for you on the fly, complete with Rng impl and such. Fancy stuff can come later once we pick a good default.

dhardy commented 6 years ago

I think 32-bit x86 is pretty dead by now, but 32-bit ARM is still common, so there is some value. Note that the default generator does not need to be the same on all platforms; however I don't think we can't switch the algorithm depending on whether more u32 or u64 output is requested.

Lokathor commented 6 years ago

The generator stepping and permuting the LCG output are separate phases of the process. As long as the generator stepping is consistent for both modes, you can use a different permutation for u32 and u64 output.

And yeah, I own a 32-bit ARM device that I use often enough, the Raspberry Pi board series. I'm sure that plenty enough other single-board computers are also 32-bit ARM devices and that people want to be able to use Rust on them.

pitdicker commented 6 years ago

For a quick summary: We are looking for an RNG that can generate u64's of good quality reasonably quickly on a 32-bit platform.

An RNG that outputs u32's is not great, because then producing one u64 means combining two outputs. This is more than twice as slow, and also reduces the period. One RNG that can generate u64's directly with good statistical quality is the 128-bit variant of PCG. Another is the 64-bit Xorshift/Xoroshiro with a widening multiply to 128 bits as an output function. Both are great on x86_64, but very slow on x86 because both need 128-bit multiplies which are not available and need to be emulated.

What we are trying here is:

PCG with a 64-bit LCG base generator
output function XSH RR 64/32 for next_u32
output function RXS M XS 64 for next_u64

As the PCG paper notes the problem of the RXS M XS variants is that every random number appears exactly once over a period of 2^64. It is relatively quickly possible to see that results are not truely random because there are no duplicates. Of course a PRNG never is truly random, but it should appear so.

The question is: does the extension mechanism of PCG not only enlarge the period, but also fix this problem with RXS M XS? That is only true if the extension array is updated much more frequently than every time the period of the base RNG crosses over. Otherwise there will still not be any doubles during the very large period of the RNG. If the extension array is small it works, because the array gets updated frequently. But the PCG extension mechanism is not really made for small extension arrays like EXT2, and is very slow when it has to update frequently (at least from what I understand of it at the moment, note I edited https://github.com/dhardy/rand/issues/52#issuecomment-348708671).

I think the problem is simple: a requirement to get the proper number of doubles according to the generalized birthday problem is that at least 128 bits of state need to get updated frequently.

It seems to me a simple solution could be good enough: xor the output of PCG RXS M XS 64 with a 64-bit counter. And if we use a Weyl sequence instead of a counter, we can maybe even get away with using MCG as a base generator (and if we want we can still have streams). So just about no slowdown :smile:. I think this gives the proper distribution, but has the consequence that some results will not appear at all...

Something like this:

fn next_u64() -> u64 {
    // MCG
    self.m = self.m.wrapping_mul(MULTIPLIER);
    // Weyl sequence
    self.w = self.m.wrapping_add(INCREMENT);
    let state = self.m ^ self.w;
    output_rxs_m_xs(state)
}

It will take a few days before I can test this though. It should also be possible to test the distribution of the results with a 32-bit variant, that would need only 4gb of memory.

Ichoran commented 6 years ago

I'm not sure the Weyl sequence adds anything beyond a simple incrementer, given the mixing afterwards (assuming we use the PCG mixers).

MCG is a bit risky given that it's degenerate when self.m is zero.

Lokathor commented 6 years ago

The question is: does the extension mechanism of PCG not only enlarge the period, but also fix this problem with RXS M XS?

It should. It is my understanding that the 64/64 permutation has problems with each output appearing exactly once precisely because the period is too small. If you used it with a larger period, it would be fine. The reason that the 128/64 scheme works fine with 64-bit output is because the period is 2^128, not because the permutation is magically better on its own.

I think the problem is simple: a requirement to get the proper number of doubles according to the generalized birthday problem is that at least 128 bits of state need to get updated frequently.

This sums it up nicely.

I think this gives the proper distribution, but has the consequence that some results will not appear at all

That sounds like a very improper solution! I'd be upset to use a generator where some results can't possibly happen, even despite the fact that any particular result only has a 1/(2^64) chance to begin with.

Proposed Alternate Solution: We could wait a release cycle or two for 128-bit rust to become stable (assuming that it's out in the next cycle or two?), write the 128/64 PCG (which will have great 64-bit output), and then just accept that it will run very slowly on a 32-bit machine and tell people in the docs.

The reason that you'd use 128/64 is because you want to focus on 64-bit output, and if you're doing something that needs 64-bits at a time but running it on a 32-bit machine I hardly know what you're doing to begin with. That's just goofy. People don't normally think about the 32-bit/64-bit jump at all, but PRNGs are one of the things where it was a big deal and it continues to be a big deal and you do have to think about it. That's not something that we can fix ourselves because it's just part of how the math and hardware works out.

pitdicker commented 6 years ago

Found time to do some testing already. Code:

    fn next_u64(&mut self) -> u64 {
        // MCG
        self.m = self.m.wrapping_mul(6364136223846793005);
        // Weyl sequence
         self.w = self.w.wrapping_add(1442695040888963407);
        let mut state = self.m ^ self.w;

        // output function RXS M XS:
        // random xorshift, mcg multiply, fixed xorshift
        const BITS: u64 = 64;
        const OP_BITS: u64 = 5; // log2(BITS)
        const MASK: u64 = BITS - 1;

        let rshift = (state >> (BITS - OP_BITS)) & MASK;
        state ^= state >> (OP_BITS + rshift);
        state = state.wrapping_mul(6364136223846793005);
        state ^ (state >> ((2 * BITS + 2) / 3))
    }

    fn next_u32(&mut self) -> u32 {
        self.m = self.m.wrapping_mul(6364136223846793005);
        self.w = self.w.wrapping_add(1442695040888963407);
        let state = self.m ^ self.w;

        // output function XSH RR: xorshift high (bits), followed by a random rotate
        const IN_BITS: u32 = 64;
        const OUT_BITS: u32 = 32;
        const OP_BITS: u32 = 5; // log2(OUT_BITS)

        const ROTATE: u32 = IN_BITS - OP_BITS; // 59
        const XSHIFT: u32 = (OUT_BITS + OP_BITS) / 2; // 18
        const SPARE: u32 = IN_BITS - OUT_BITS - OP_BITS; // 27

        let xsh = (((state >> XSHIFT) ^ state) >> SPARE) as u32;
        xsh.rotate_right((state >> ROTATE) as u32)
    }

~~I was wrong when estimating the period, because the MCG is off by 1. This helps a lot: Period MCG: 2^62 - 1 Period Weyl sequence: 2^64 Combined period: smallest common multiple = 2^64 * (2^62 - 1) = 2^126~~ Edit: I really have to learn more about MCG, choosing good multipliers and how they relate to the period. Period of this combination is 2^64.

Performance is not bad, but not as good as I hoped. About 15~25% better than combining two outputs from PXG XSH 64/32 RR. Benchmarks with Xorshift128/32 (current RNG in rand) as a baseline:

x86_64:

test gen_u32_mwp                 ... bench:       1,472 ns/iter (+/- 8) = 2717 MB/s
test gen_u32_xorshift_128_32     ... bench:       1,082 ns/iter (+/- 2) = 3696 MB/s

test gen_u64_mwp                 ... bench:       1,363 ns/iter (+/- 6) = 5869 MB/s
test gen_u64_xorshift_128_32     ... bench:       2,638 ns/iter (+/- 23) = 3032 MB/s

x86:

test gen_u32_mwp                 ... bench:       3,311 ns/iter (+/- 13) = 1208 MB/s
test gen_u32_xorshift_128_32     ... bench:       1,475 ns/iter (+/- 60) = 2711 MB/s

test gen_u64_mwp                 ... bench:       5,101 ns/iter (+/- 27) = 1568 MB/s
test gen_u64_xorshift_128_32     ... bench:       4,591 ns/iter (+/- 59) = 1742 MB/s

PractRand seems pretty happy with it until now (half a terabyte tested).

@Ichoran

MCG is a bit risky given that it's degenerate when self.m is zero.

Good point. We would have to make sure the seed is not 0, just like we have to for Xorshift/Xoroshiro and PCG with MCG as a base generator.

It is my understanding that the 64/64 permutation has problems with each output appearing exactly once precisely because the period is too small. If you used it with a larger period, it would be fine.

Yes. Although it also depends a little on how the period works. For example imagine a scheme where the base generator first gives every number between 0 and 2^64 in one order, and for the next period every number again only once but in some other order. That is why I wanted to know the details of the extension mechanism.

@Lokathor

I think this gives the proper distribution, but has the consequence that some results will not appear at all

That sounds like a very improper solution! I'd be upset to use a generator where some results can't possibly happen, even despite the fact that any particular result only has a 1/(2^64) chance to begin with.

I agree it is not nice. On the other hand I don't think it matters. You only know which values are missing after generating and keeping track of 2^64 numbers. I don't think that is even possible. And for every seed the numbers that are double / triple, and the results that are missing are different. ~~But it doesn't matter, the period is larger than I estimated.~~

Thanks both for thinking about along seriously!

I am not going to push this RNG too far, but it seems to work well and is faster than the other alternatives for generating good-quality u64's on x86.

pitdicker commented 6 years ago

Another possible solution: the XSH output function needs only 6 bits to do it's work, and should be able to output up to 58 bits. The mantissa of a f64 can only store 53 bits. So we could make something like PCG XSH RR 64/53 work.

I see a few disadvantages though:

generating doubles directly does not fit with the current trait design.
we would have to use a more primitive conversion to doubles, instead of one that gets higher precision closer to zero.
It is still a little slower than my concoction from the previous comment :smile:

Lokathor commented 6 years ago

Well, converting one or more u64 values into a "good" f64 value (for some range and distribution and such) is something that's rather external to any particular generator. Assuming that you have a uniform production of u64 values, there should be a single formula that takes a generator and makes a f64 for any given distribution you want. Of course, some generators are ever so slightly non-uniform, and so they will make the final f64 ever so slightly non-uniform, but that's something the end user will have to care about (or not) when they're picking a generator. We can only document it and hope they read the documentation.

dhardy / rand

Shootout: small, fast PRNGs #52