PiRSquared17 / aparapi

Automatically exported from code.google.com/p/aparapi
Other
0 stars 0 forks source link

JTP orders of magnitude slower than SEQ (even with substantial work sizes) #61

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Execute a kernel in JTP mode (see attached test code).

What is the expected output? What do you see instead?
Expected: the execution time of JTP is similar to or less than SEQ.
See: the execution time of JTP is orders of magnitude higher than SEQ for work 
sizes up to about 32000.

What version of the product are you using? On what operating system?
aparapi-2012-05-06.zip (R#407, May 6).
Ubuntu 12.04 x64
nVidia gt540m, driver version 295.40, cuda toolkit 4.2.9
Intel Core i7-2630QM (2GHz, quad-core (pretend 8 with hyper-threading...))

Please provide any additional information below.
I tested a kernel that performs numerous functions on the work item ID 
(trigonometric, cube root, exponential) which are all added or multiplied 
together. The kernel was tested over work loads ranging in size from 2 to 
1048576, over 1024 iterations for each size.

In a subsequent test I tried executing the kernel in JTP mode with a group size 
of 4 to match the number of CPU cores (rather than letting Aparapi choose the 
group size). The results were much improved for work sizes up to 262000 (but 
slightly worse for work sizes larger than this), see the second set of results 
below. So perhaps this is simply a matter of working out how to choose a good 
group size (number of threads?) in JTP mode.

Results, letting Aparapi choose group size in JTP mode:
2:  SEQ: 0.009s JTP: 0.148s GPU: 0.335s 
4:  SEQ: 0.015s JTP: 0.289s GPU: 0.135s 
8:  SEQ: 0.005s JTP: 0.361s GPU: 0.144s 
16:     SEQ: 0.01s  JTP: 0.628s GPU: 0.123s 
32:     SEQ: 0.015s JTP: 1.193s GPU: 0.118s 
64:     SEQ: 0.028s JTP: 2.792s GPU: 0.117s 
128:    SEQ: 0.054s JTP: 6.153s GPU: 0.108s 
256:    SEQ: 0.112s JTP: 14.786s    GPU: 0.12s  
512:    SEQ: 0.211s JTP: 15.251s    GPU: 0.111s 
1024:   SEQ: 0.402s JTP: 15.263s    GPU: 0.124s 
2048:   SEQ: 0.754s JTP: 15.662s    GPU: 0.151s 
4096:   SEQ: 1.467s JTP: 15.655s    GPU: 0.167s 
8192:   SEQ: 2.844s JTP: 15.806s    GPU: 0.256s 
16384:  SEQ: 5.747s JTP: 15.932s    GPU: 0.399s 
32768:  SEQ: 11.366s    JTP: 16.49s GPU: 0.701s 
65536:  SEQ: 22.775s    JTP: 17.414s    GPU: 1.313s 
131072:     SEQ: 45.818s    JTP: 21.927s    GPU: 2.538s 
262144:     SEQ: 91.924s    JTP: 32.749s    GPU: 4.974s 
524288:     SEQ: 183.459s   JTP: 56.879s    GPU: 9.852s 
1048576:    SEQ: 369.247s   JTP: 102.847s   GPU: 19.615s

Results when specifying a group size matching the number of CPU cores in JTP 
mode:
2:  SEQ: 0.008s JTP: 0.266s GPU: 0.325s 
4:  SEQ: 0.003s JTP: 0.218s GPU: 0.133s 
8:  SEQ: 0.005s JTP: 0.193s GPU: 0.131s 
16:     SEQ: 0.009s JTP: 0.187s GPU: 0.125s 
32:     SEQ: 0.014s JTP: 0.17s  GPU: 0.122s 
64:     SEQ: 0.027s JTP: 0.175s GPU: 0.117s 
128:    SEQ: 0.054s JTP: 0.176s GPU: 0.116s 
256:    SEQ: 0.108s JTP: 0.191s GPU: 0.135s 
512:    SEQ: 0.219s JTP: 0.23s  GPU: 0.1s   
1024:   SEQ: 0.403s JTP: 0.292s GPU: 0.11s  
2048:   SEQ: 0.749s JTP: 0.389s GPU: 0.13s  
4096:   SEQ: 1.454s JTP: 0.599s GPU: 0.157s 
8192:   SEQ: 2.872s JTP: 1.03s  GPU: 0.235s 
16384:  SEQ: 5.714s JTP: 1.938s GPU: 0.389s 
32768:  SEQ: 11.297s    JTP: 4.155s GPU: 0.695s 
65536:  SEQ: 22.803s    JTP: 7.823s GPU: 1.305s 
131072:     SEQ: 46.006s    JTP: 15.562s    GPU: 2.525s 
262144:     SEQ: 92.34s JTP: 30.026s    GPU: 4.968s 
524288:     SEQ: 184.077s   JTP: 61.684s    GPU: 9.839s 
1048576:    SEQ: 370.805s   JTP: 121.218s   GPU: 19.595s

Original issue reported on code.google.com by oliver.c...@gmail.com on 9 Aug 2012 at 6:07

Attachments:

GoogleCodeExporter commented 9 years ago
Thanks again.  Extra kudos for also providing the test code. 

I need to run off to Excel to chart this.  I would imagine that the knee of the 
curve is above 256 but below 4092.  I would argue that Aparapi is really not 
suitable for global sizes < 1k-2k... 

As you eluded.  This is actually an artifact of the global size and the default 
method for choosing group sizes. It is also a problem exposed by choosing a 
default group size before knowing which device to pick. 

By default (when executing kernel.execute(int)) aparapi creates an interim 
Range object, but does not *know* where the actual code will be executed 
(OpenCL GPU, CPU, JTP or SEQ) so we pick a range which is optimal for GPUs.  
This means that we try to get a group size as close to 256 as we can.  For JTP 
this actually means we will spawn 256 threads!   For very large global sizes 
this actually works out to be good (especially for regular and predictable 
compute loads), for smaller sizes this turns out to be an anti-pattern.  

Clearly we need a better approach.  

You will notice that more recently I am pushing people towards choosing a 
device, creating a Range for that device and then dispatching using a specific 
Range. 

Device device = .... // get device
Range range = device.createRange(globalSize)
kernel.execute(range)

For GPU devices (at present) in the main trunk this ensures that the Range is 
'ideal' for the chosen device.  My hope is to use this pattern for JTP also - 
this code is not complete or even fleshed out.

Device device = Device.JTP(); // no such API
Range range = device.createRange(globalSize)
kernel.execute(range)

This would allow the range to match the # of cores in the case of JTP. 

There will still be the issue of 'fall-back' for when the bytecode cannot be 
converted to OpenCL. In this case JTP is just a safety net and performance may 
well always lag SEQ for small (<4k) global sizes.

I will keep this open and will try to come up with a better 'default' strategy. 

Gary

Original comment by frost.g...@gmail.com on 9 Aug 2012 at 2:13

GoogleCodeExporter commented 9 years ago
Why is one CPU thread spawned for each work-item in a work group? It is more 
natural to execute one group on one CPU core. This mirrors how groups are 
executed on a GPU. If a kernel is optimized for memory locality (ie it uses 
shared memory or L1 cache), it should be faster. In any case, setting group 
size to 256 should be [mostly] optimal for any opencl device. I hear HotSpot 8 
will autovectorize, in which case, again, all work-items of a group should be 
in one thread.

Original comment by adubin...@almson.net on 15 Feb 2013 at 11:53

GoogleCodeExporter commented 9 years ago
We have to spawn one thread per work item, otherwise a barrier() across the 
group would deadlock.  It is the closest way to emulate OpenCL behaviour.

I am not a fan of this, if you can envision a better way I would love to try 
it. 

Original comment by frost.g...@gmail.com on 16 Feb 2013 at 12:03

GoogleCodeExporter commented 9 years ago
Well how do you implement SEQ, then?

A work group should be realized as a for() loop going over each work item. 
Every time there is a barrier, you start a new for() loop. Do not use real 
thread synchronization primitives.

Eg:
for(int i = 0; i < groupSize; i++)
{
    // do stuff per thread
    // any call to getGlobalId() returns i
}
// barrier() was executed here
for(int i = 0; i < groupSize; i++)
{
    // continue our kernel
}

The big drawback of this is that it hides concurrency bugs. For debugging, I 
suppose, the code can be executed the way it is now (although, in reality, 
emulating the concurrent idiosyncracies of real GPUs is a huge task in 
itself... better to imagine some sort of real in-hardware debugger).

Original comment by adubin...@almson.net on 16 Feb 2013 at 12:53

GoogleCodeExporter commented 9 years ago
adubinsky,

Thank you for the suggestion. Are you available and/or interested in looking at 
the latest Trunk code and implementing your suggested fix in a branch that we 
can test?

Original comment by ryan.lam...@gmail.com on 22 Apr 2013 at 5:11

GoogleCodeExporter commented 9 years ago
I got a roughly 15x speed-up in one app in JTP mode by modifying KernelRunner 
to use a standard thread pool (java.util.concurrent.Executors/ExecutorService). 
There's a tremendous amount of overhead in creating and destroying threads 
rapidly.

I added one field:
   private final ExecutorService threadPool = Executors.newCachedThreadPool();

I removed threadArray since it wasn't really used, and instead of new 
Thread().start():

    threadPool.submit(new Runnable(){....});

Without changing workgroup size/dimensions, this was a very effective speedup.

Original comment by paul.mi...@gmail.com on 12 Jun 2013 at 1:42

GoogleCodeExporter commented 9 years ago
(forgot to add, need to do a threadPool.shutdownNow() within the dispose() 
method)

Original comment by paul.mi...@gmail.com on 13 Jun 2013 at 7:22

GoogleCodeExporter commented 9 years ago
paul:

A threadpool can't be used. Your threads will deadlock if they try to 
synchronize.

ryan:

I'm not able to help. But I can say the strategy is to use continuations. 
There's some libraries available 
(http://stackoverflow.com/questions/2846428/available-coroutine-libraries-in-jav
a) but they seem pretty old and unmaintained.

Anyway, it proves it is possible in Java. A custom implementation can perhaps 
be simpler and faster. (Eg, exploit the fact there's no recursion in OpenCL and 
the used stack size can be pre-computed.)

Original comment by adubin...@almson.net on 21 Jun 2013 at 5:47

GoogleCodeExporter commented 9 years ago
Why would a threadpool cause a deadlock? The only difference is that the 
threadpool will re-use threads. A thread is not "tainted" from running a 
kernel, and so should be re-usable.

Original comment by paul.mi...@gmail.com on 21 Jun 2013 at 8:30

GoogleCodeExporter commented 9 years ago
I think the concern is that unlike most multithreaded applications Aparapi apps 
must map to the 'work group' model used by OpenCL.

This is required so that Kernel.localBarrier() is honored. 

My take is that provided the pool of threads is equal to the width of a group 
(which I think is what we have) then we are safe.

If the pool was smaller than a group, we would indeed deadlock if a kernel 
contained

Kernel k = new Kernel(){
   public void run(){
       // do something
       localBarrier();
       // do something else
   }
};

'adubinsky' (apologies I do not know your name)  is it your understanding that 
by accepting this patch that we now may deadlock. If so can you elaborate, I 
still think we are good.

BTW Continuations would be very cool indeed.  I have seen some work attempting 
to do this in Java, I must admit it was something I am glad I did not take on ;)

Gary

Original comment by frost.g...@gmail.com on 21 Jun 2013 at 9:15

GoogleCodeExporter commented 9 years ago
The threadpool uses the same safety mechanism already in place for the new 
Thread() approach, the join barrier in KernelRunner.java: await(joinBarrier);

Without the barrier, there would be concurrency problems either way, whether 
the threads are newly constructed or re-used.

Original comment by paul.mi...@gmail.com on 21 Jun 2013 at 9:21

GoogleCodeExporter commented 9 years ago
Paul I agree,  I just want to make sure that we are not missing something and 
want to give 'adubinsky' a chance to elaborate.  There may be a corner case we 
have missed.

Gary 

Original comment by frost.g...@gmail.com on 21 Jun 2013 at 9:57

GoogleCodeExporter commented 9 years ago
Sorry, newCachedThreadPool() should indeed work. I double-checked the docs, and 
it guarantees that all submitted tasks are run immediately.

I mixed it up with the more-common newFixedThreadPool, figuring you were trying 
to reduce the total concurrent threads. newCachedThreadPool solves the issue of 
kernel launch overhead for short-running kernels, but shouldn't speed up 
long-running kernels. Using continuations should help in the latter case by 
getting rid of the OS overhead.

Original comment by adubin...@almson.net on 22 Jun 2013 at 6:29