chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.8k stars 423 forks source link

Add some sort of mechanism to have a non 1:1 mapping between loop iterations and GPU threads #22175

Open stonea opened 1 year ago

stonea commented 1 year ago

So far, given a GPUizable loop like:

foreach i in 0..<N do ...

when this is run there will be a one-to-one mapping between loop iterations and GPU threads.

There may be various reasons for users to not one a one-to-one mapping; for example, the iteration space may exceed the maximum allowed number of threads. Also see: https://github.com/chapel-lang/chapel/issues/22152#issuecomment-1525828257

To change the mapping users could rewrite their loops to use an inner for, say something like:

foreach i_prime in 0..<N by 2 do ...
  for i in i_prime..<min(i_prime+1, N) do ...

But maybe it would nicer if we had some language feature to do that. For example:

foreach i in 0..<N with (config cfg = new LoopContext(threadSize=2)) do ...
psath commented 1 year ago

This relates to #22176 and #22177 when trying to implement the common GPU pattern of an intra-thread loop with a by clause of gridDim*blockDim.

for (int x = threadIdx.x + blockDim.x*blockIdx.x; x < foo; x+= gridDim.x*blockDim.x) {
  //Do something with contiguous indices in the same warp
}

Important for keeping the threads in the warp working on contiguous indices of the iteration space for performance.

vasslitvinov commented 3 months ago

While #25855 adds an initial implementation of this feature, I am keeping this issue open because the comment https://github.com/chapel-lang/chapel/pull/25855#issuecomment-2327527618 requests a variant of this feature where the selection of the loop iterations to be executed by a given thread is cyclic / round-robin, rather than sequential, as in #25855. We also need to implement this feature for reduction kernels.

e-kayrakli commented 1 month ago

Thanks for working on this, @vasslitvinov !

vasslitvinov commented 1 month ago

Oops, I did not mean to close it, as we still need to implement it for reduce kernels. Reopening.

e-kayrakli commented 1 month ago

Tagging @stonea as we have been discussing var intent support for GPU kernels. The reason is, to support reductions alongside itersPerThread we need to have a way to create per-thread variables in the generated GPU kernel. And I think that requirement is common for both efforts.

I don't think that is a big effort, but I wanted both of you to be aware of the related work to avoid stepping on toes / repeating effort etc. As of today, neither effort is scheduled for the near term, but would be good to get done in a release or two.