Open e-kayrakli opened 1 year ago
You can understand the issue if you look at PCGRandomPrivate_iterate
. That has a for
loop (rather than a foreach
) in the follower loop that yields. The reason for this is that the cursor
is storing the RNG state and being updated in each call to randlc
. Ideally, we would have this working in a vectorizable way but it's hard for me to see how to do that without foreach
intents / per-vector-lane variables. And even then, I'm not sure how to write it efficiently.
Do you think that we could have a GPU forall
using this iterator just serialize the computation from the iterator's for
loop? Just because the for
loop computing the random numbers is serial doesn't mean that we can't make a kernel out of the forall
/ foreach
running it; provided we have a way to do the synchronization between GPU threads to make the iterator's computation serial.
Do you think that we could have a GPU forall using this iterator just serialize the computation from the iterator's for loop? Just because the for loop computing the random numbers is serial doesn't mean that we can't make a kernel out of the forall / foreach running it; provided we have a way to do the synchronization between GPU threads to make the iterator's computation serial.
Probably we could with a considerable amount of effort, but that feels more academic than production-oriented. If I am not mistaken, we would have to make sure that all GPU threads go one-by-one in the same deterministic order picking their random numbers from the iterator. Wouldn't that be terribly inefficient?
I think the first step here is to understand how people generate random numbers on GPUs (which you may have the answer to :) ) and write a GPU-oriented RNG.
Note that urgency of the issue is significantly reduced if you write your application in a way that the random data is generated on the host and then copied to the device. Which may just be the common case in the GPU programming world.
If I am not mistaken, we would have to make sure that all GPU threads go one-by-one in the same deterministic order picking their random numbers from the iterator. Wouldn't that be terribly inefficient?
The key point is that only the random number generation portion is serialized. The rest of the user's loop (which probably does something less trivial) is not. The core of the RNG is a couple of instructions.
I.e., say we have K GPU threads running simultaneously on the device:
GPU thread 0 generates the next K random values
barrier the K threads
each thread copies its random value from the generated ones
each thread proceeds with whatever the user wrote in their forall loop body, working in parallel
I don't know how inneficient this is but I'd imagine it wouldn't be too bad, because the serial part is normally short (compared to the user's forall loop body) and I understand the GPU thread barrier to be pretty fast. Note that you could probably use the thread divergence stuff to handle the barrier without actually even writing a barrier.
I think the first step here is to understand how people generate random numbers on GPUs (which you may have the answer to :) ) and write a GPU-oriented RNG.
I don't know the answer to this. It'd be my hope that we can adjust the PCG RNG to be GPU-friendly rather than creating a wholly different RNG.
barrier the K threads
You can't sync all GPU threads while executing the kernel unless you use... "cooperative threads" or something that CUDA rolled out in a relatively recent version. Is that not right? But I am guessing what you're referring to as "barrier without a barrier" is by using some sort of atomic operation of sorts, which sounds more doable.
I don't know how inneficient this is but I'd imagine it wouldn't be too bad, because the serial part is normally short (compared to the user's forall loop body)
Yeah, it is promising. I was thinking that RNG to be more time consuming. It'll probably be bound on how fast a single thread can write to the global memory sequentially.
As a further question on our RNGs: skipping N random numbers is the same as just generating those N in terms of performance isn't it? Otherwise, maybe a thread per block can generate random numbers for all its peers as a potential (premature?) optimization on your suggestion.
As a further question on our RNGs: skipping N random numbers is the same as just generating those N in terms of performance isn't it?
No, it's not, but generating the N random numbers is sequence is quite a bit faster than generating them independently by "seeking". I don't remember exactly but I think the "seeking" operation is O(log N).
Interesting, but I guess seeking in a stream alters its state so it is not parallel-safe.
We can copy the RNG state and seek independently in each copy.
I added "user issue" to this to denote that this is required by some of the Monte Carlo simulations we are developing with Hui. The general need there is not about fillRandom
, but more about random streams that execute on the device. I believe they are one and the same, but interface-wise, what's required may be slightly different.
For reference https://github.com/e-kayrakli/mc321.chpl uses CUDA's RNG for a related Monte Carlo simulation.
Our GPU-based stream implementations populate vectors "manually" to avoid calling
fillRandom
as the CPU version does. I believe the main motivation for that is now moot and we can call fillRandom safely.However, things are more complicated for
array_on_device
mode.fillRandom
does a forall over the array and an internal RNG iterator to populate the array. Thatforall
is currently not GPU-eligible, so it executes on the CPU. WithCHPL_GPU_MEM_STRATEGY
that's OK, but it may not be the case withCHPL_GPU_MEM_STRATEGY=array_on_device
. I believe especially after https://github.com/chapel-lang/chapel/pull/22349 things may work by virtue of the CPU doing a PUT to GPU memory which will be thencuMemcpy
ed under the hood. That, at the very least, will have performance implications even if it works.As of today, I am choosing to keep our GPU streams to use manual data, until we spend some time understanding what
fillRandom
does, and whether we can make its implementation GPU-eligible or need to implement a more efficient RNG for GPU-based data.I am tagging this as "Unimplemented Feature" though it is more of a "feature that needs more attention before we say it can be used".
TODO:
stream.chpl
(GPU test) to use fillRandommultiGpuStream.chpl
(GPU test) to use fillRandom