RoshanGerard / aparapi

Automatically exported from code.google.com/p/aparapi
Other
0 stars 0 forks source link

How to control locking #13

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
I am trying to implement a basic reduction algorithm, like "max".
Using the basic approach, each kernel will handle two elements and gradually 
compose the result. This approach requires that the kernels run in perfect 
lockstep otherwise the combined result will not be correct.

Sample implementation idea:
http://developer.apple.com/library/mac/#samplecode/OpenCL_Parallel_Reduction_Exa
mple/Listings/reduce_float_kernel_cl.html#//apple_ref/doc/uid/DTS40008188-reduce
_float_kernel_cl-DontLinkElementID_7

Pseudo-code:

int id = getGlobalId();
for(int scan = 0; i < scans; i++) {
  int other = (1 << scan) + id;
  if (other < length)
    shared[id] = Math.max(shared[id], shared[other]);
}

Is it possible to explicitly emit lock values?

One solution is to use the "passes" to issues the "scan" number, but this seems 
fairly wastefull as it will re-execute the kernels rather than keep them 
running.

In my case the reduction happens after a larger number of operations, so I need 
to produce a separate kernel to do the reduction as I cannot just run multiple 
passes on the entire kernel. Using the extra reducer kernel means that I need 
to replicate both support code and copy large arrays between the two.

Are there any thoughts on how this should be solved with Aparapi?

Original issue reported on code.google.com by kenneth@hexad.dk on 28 Oct 2011 at 12:04

GoogleCodeExporter commented 9 years ago
You have highlighted a problem that I think we need to solve with Aparapi to 
help with map reduce type issues.

One trick (you may have tried this) is to use the pass # (from 
kernel.execute(range,passes) ) to indicate that the final pass should perform 
the reduce step.  This way the data will be kept on the GPU and not shuffled 
back and forth. 

So with a Kernel like 

k = new Kernel(){
  public void run (){
   if (getPassId()<19){
     // do first 19 (0..18) passes 
   }else{
     // do final reduction pass (19)
   }
  }
} 

k.execute(1024, 20);

The real solution (and I am looking at this now) is some way to allow a Kernel 
to have multiple entrypoints.  I have yet to come to a clean syntax for this, 
so would welcome suggestions.

Original comment by frost.g...@gmail.com on 28 Oct 2011 at 3:22

GoogleCodeExporter commented 9 years ago

Original comment by frost.g...@gmail.com on 28 Oct 2011 at 3:23

GoogleCodeExporter commented 9 years ago
I thought of that, but would that not cause every core to execute both branches 
for each pass (without storing the results)? Or is there a special case that 
happens when all cores take the same branch?

Original comment by kenneth@hexad.dk on 7 Nov 2011 at 12:09

GoogleCodeExporter commented 9 years ago
Sorry Kenneth for some reason I missed your question. 

Because we use global memory (as far as GPU is concerned) we must wait for all 
Kernels to complete.  Conceptually they are all running at the same time and we 
can never expect one Kernel to see the result of another.  Even if we know the 
group order.  So I think this will always require us to relaunch the kernel. 

Relaunch is not *that* expensive, especially if we can avoid moving buffers by 
setting setExplicit(true) and taking buffer transfer control ourself. 

I just added a 'life' demo (Conways game of life) which executes a Kernel 
inside a fairly tight loop and which only pulls the buffer from the GOU when 
Swing wishes to display the data. This might help you in your case as well. 

Unless I have completely missed the point ;) 

Gary

Original comment by frost.g...@gmail.com on 10 Nov 2011 at 7:30

GoogleCodeExporter commented 9 years ago
I discovered the "localBarrier()" function that seems to do what I was looking 
for (synchronize).
I did try the other suggestion (multi-pass) and that seemed to have a heavy 
performance penalty, but I abandoned the idea, so not sure what caused it.

Feel free to close this issue.

Original comment by kenneth@hexad.dk on 27 Dec 2011 at 9:56

GoogleCodeExporter commented 9 years ago

Original comment by ryan.lam...@gmail.com on 20 Apr 2013 at 12:30