bashtage / ng-numpy-randomstate

Numpy-compatible random number generator that supports multiple core psuedo RNGs and explicitly parallel generation.
Other
45 stars 14 forks source link

Jump is too slow for larger jump steps! #123

Closed jyzhang-bjtu closed 6 years ago

jyzhang-bjtu commented 6 years ago

The time consumption of Jump(i) function seems to depend on the step number i.

Here is the sample code

import timeit import randomstate.prng.xoroshiro128plus as rnds def func0(N,offset): rs1 = [rnds.RandomState(0) for _ in range(N)] for i in range(N): rs1[i].jump(i+offset) t0 = timeit.timeit('func(16,0)', setup="from __main__ import func", number=1) t1 = timeit.timeit('func(16,4000000)', setup="from __main__ import func", number=1)

t0 is about 0.006s t1 is 1.06s and 50 times slower than t0. why?

bashtage commented 6 years ago

Calling jump(n) internally run a loop that does a single unit jump n times.

I don't understand what you would want to accomplish by using a large jump.

jyzhang-bjtu commented 6 years ago

I just want a large amount of independent random streams, the number of which is up to 4000*8192. These random streams are used in the deep learning for the repeatability between training epochs. So I use the randoms with the same seed and jump a number of steps to obtain all these random streams. I suggest it is better to have a initialization parameter offset as curand and the pseudo code is as follows:

rs1 = [rnds.RandomState(0, offset=i) for i in range(N)]

bashtage commented 6 years ago

That won't make it any faster. The jump requires a precomputed polynomial coefficient to be used and so uses a fixed size -- for example, 2^512 in the case of xorshift1024. So if you want to advance 10 times, the polynomial is applied to the state 10 times.

You might consider the PCG64 generator. It has actual independent streams and can be initialized like

rs1 = [prng.pcg64.RandomState(seed,i) for i in range(N)]

TBH I don't really understand your strategy or see why you need this much independence since the number of random produced by each stream is likely pretty small.

From the other thread

For example, I want generate an Epoch of size 1e5. The total data size, for example, is (1e5, 256), where 256 is the sample data length. The I want to shuffle the data set and iteratively learn on this data set. However, if I use the rand(1e5, 256), then I can not shuffle data and work on the same dataset.

It sounds like you could use 1 RandomState to produce the randoms 1e5*256 is very quick and then a second RandomState to do the shuffle. This would allow full reproducibility I think.

Something like

number_gen = randomstate.prng.xorshift1024.RandomState(seed)
number_gen.jump()
initial_state = number_gen.get_state()  # In case you want to return later
shufler = randomstate.prng.xorshift1024.RandomState(seed)
rvs = number_gen.rand(1e5,256)
shufler.shuffle(...)
number_gen.set_state(initial_state)  # Back to the original values, if needed

Although it is possible I don't fully understand your problem.

jyzhang-bjtu commented 6 years ago

Let A0 = (nEpoch, nSample) be the random array I want to generate. A0[i] should be independent random sequences and should be repeatable during learning. Furthermore, A0[i] should be shuffled during training. You suggestion is ok for small nEpoch, such as 8192 and below. However, the whole array A0 should be cached or wrote to the disk file for future shuffling. If nEpoch is larger enough, such as 4000*8192, it is impossible to write it on the disk or cache it in the memory. So I need to generate A0 batch by batch, the batch size of which can be (8192,256) or smaller. Then I will require a large number of random streams (4000x8192) and use a independent RNG to shuffle those streams for the repeatability.

jyzhang-bjtu commented 6 years ago

Thanks for your suggestion. I will use PCG-64

bashtage commented 6 years ago

Another option you could consider is precomputing the state and saving it (don't pickle, just save the core state). The state for xoroshiro128 is 2 64 bit numbers. So even if you have 32M copies of the prng the entire set of states could be stored in 16*32M bytes, and then this could me memmapped when consuming. The saving the core state vector is much smaller than saving generated random numbers. It could be restored using something like

# states is a 1d array of uint64 with 2 * 4000 * 8192 elements
idx = 1342345 # prng index

import randomstate as rs
rs = rs.prng.xoroshiro128plus.RandomState()
curr_state = rs.get_state()
curr_state['state'] = tuple(states[2*idx:2*idx+1])
rs.set_state(curr_state)
jyzhang-bjtu commented 6 years ago

Thanks, storing state is another solution. Combined with h5py, xoroshiro1024plus can also be used. By the way, PCG64 is fast enough.

bashtage commented 6 years ago

I think I finally understand your problem. You need to be able to effectively jump around a random array with ~40008192256 elements although you are happy to do this in blocks of 256. In this problem, I can think of two solutions.

  1. Use a PRNG that supports real streams -- PCG64. Ideally, there would be more support for proper streaming PRNGs since these are really the best way to generate random numbers in high dimensional spaces.
  2. Use an array of PRNG that supports jump. Here you need to make some trade-offs. For example, suppose you use 2^13 PRNGs and you uniformly slice your space into 2^13 disjoint segments. Suppose you need to get block M (in 0,1,...,81964000), then you could compute M//2^13 to get the PRNG number, and then you would need to generate `(M - offset) 256randoms whereoffset=2^13 * (M//2^13), and only use the last 256 observations. This is computationally wasteful but relatively memory efficient. The initial method you were using had M PRNGs so thatoffset=0` in all cases.
jyzhang-bjtu commented 6 years ago

Yes, I really need to be able to effectively jump around a random array with ~40008192256 elements in a certain block length. The block length, here equals to 256, is just an example. According to my app, the block length varies from 256 to 8096 or higher. The type of random number maybe int or floats. The distribution includes uniform and normal. So I need to jump between blocks as quick as possible because the time consumption of random generation should as small as possible. As a conclusion,

  1. PCG64 is perferred because it is fast. The time consumption for jumping 8096 blocks of length 256 is about 0.5s.
  2. If an array of PRNG, such as MT19937 or Xor1024plus, is used, cuda accelerated RNG should be used. PyCuda maybe a solution. However the Pycuda only provide the uniform, gauss distributions.
bashtage commented 6 years ago

Numba supplies xoroshiro128 for GPU. xorshift1024 should be programmable for the GPU using numba.

On Fri, Mar 2, 2018 at 12:42 PM jyzhang-bjtu notifications@github.com wrote:

Yes, I really need to be able to effectively jump around a random array with ~40008192256 elements in a certain block length. The block length, here equals to 256, is just an example. According to my app, the block length varies from 256 to 8096 or higher. The type of random number maybe int or floats. The distribution includes uniform and normal. So I need to jump between blocks as quick as possible because the time consumption of random generation should as small as possible. As a conclusion,

  1. PCG64 is perfert because it is fast. The time consumption for jumping 8096 blocks of length 256 is about 0.5s.
  2. If an array of PRNG, such as MT19937 or Xor1024plus, is used, cuda accelerated RNG should be used. PyCuda maybe a solution. However the Pycuda only provide the uniform, gauss distributions.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/bashtage/ng-numpy-randomstate/issues/123#issuecomment-369910937, or mute the thread https://github.com/notifications/unsubscribe-auth/AFU5RXC-Sn-gqACDdzjkSigF1sf8ucD6ks5taT4bgaJpZM4SZda4 .