Closed GoogleCodeExporter closed 9 years ago
You have highlighted a problem that I think we need to solve with Aparapi to
help with map reduce type issues.
One trick (you may have tried this) is to use the pass # (from
kernel.execute(range,passes) ) to indicate that the final pass should perform
the reduce step. This way the data will be kept on the GPU and not shuffled
back and forth.
So with a Kernel like
k = new Kernel(){
public void run (){
if (getPassId()<19){
// do first 19 (0..18) passes
}else{
// do final reduction pass (19)
}
}
}
k.execute(1024, 20);
The real solution (and I am looking at this now) is some way to allow a Kernel
to have multiple entrypoints. I have yet to come to a clean syntax for this,
so would welcome suggestions.
Original comment by frost.g...@gmail.com
on 28 Oct 2011 at 3:22
Original comment by frost.g...@gmail.com
on 28 Oct 2011 at 3:23
I thought of that, but would that not cause every core to execute both branches
for each pass (without storing the results)? Or is there a special case that
happens when all cores take the same branch?
Original comment by kenneth@hexad.dk
on 7 Nov 2011 at 12:09
Sorry Kenneth for some reason I missed your question.
Because we use global memory (as far as GPU is concerned) we must wait for all
Kernels to complete. Conceptually they are all running at the same time and we
can never expect one Kernel to see the result of another. Even if we know the
group order. So I think this will always require us to relaunch the kernel.
Relaunch is not *that* expensive, especially if we can avoid moving buffers by
setting setExplicit(true) and taking buffer transfer control ourself.
I just added a 'life' demo (Conways game of life) which executes a Kernel
inside a fairly tight loop and which only pulls the buffer from the GOU when
Swing wishes to display the data. This might help you in your case as well.
Unless I have completely missed the point ;)
Gary
Original comment by frost.g...@gmail.com
on 10 Nov 2011 at 7:30
I discovered the "localBarrier()" function that seems to do what I was looking
for (synchronize).
I did try the other suggestion (multi-pass) and that seemed to have a heavy
performance penalty, but I abandoned the idea, so not sure what caused it.
Feel free to close this issue.
Original comment by kenneth@hexad.dk
on 27 Dec 2011 at 9:56
Original comment by ryan.lam...@gmail.com
on 20 Apr 2013 at 12:30
Original issue reported on code.google.com by
kenneth@hexad.dk
on 28 Oct 2011 at 12:04