Open GoogleCodeExporter opened 9 years ago
Oh sorry, Kernel currently does not inherit from KernelRunner but only
delegates the call. Sorry for the inconsistence ...
Original comment by matthias.klass@gmail.com
on 11 Jul 2013 at 9:49
Ok here is a proof of concepts. Multiple entrypoints are not working yet, but I
changed the API to support passing of kernel instances. Mandelbrot compiles and
works using GPU execution.
Link: https://github.com/klassm/aparapi-clone
(just temporary on Github ...)
Matthias
Original comment by matthias.klass@gmail.com
on 11 Jul 2013 at 12:58
A simple question: When executing a new entry point, the one has to be prepared
for execution by generating code and by setting the proper JNI args. Is it ok
to reinit the JNI args with all the kernel entrypoint arguments?
Kernel A: fieldA, fieldB
Kernel B: fieldC, fieldD
The total of all four fields would be used afterwards. This has the disdvantage
that only update calls all fields have to be updated, which might result in a
slowdown. However, a workaround would be to store the kernelargs per entrypoint
a map and only call update on required arguments.
What do you think?
Original comment by matthias.klass@gmail.com
on 11 Jul 2013 at 2:03
Matthias, thanks for diving in here.
When you look through the code you will see some coding attempts at multiple
entrypoint. Actually you may see multiple attempts to solve this problem in the
code as you experiment.
Each time I got stuck on different aspects.
Initially it was how to dispatch. With a single abstract method type
(Kernel.run()) as the only entrypoint it is easy to map Kernel.execute() to
dispatch the Kernel.run() method. When there are multiple possible entrypoints
I think we need to either rely on String name mapping and/or reflection. The
new method handle (Java 7) does offer a cleaner mapping.
BTW I think the bound interface approach that we used for accessing pre
constructed OpenCL may offer a possible solution here. So instead of just
extending Kernel a user creates an interface which exposes the 'entrypoints'
and implements the interface as part of their Kernel definition. Then we can
use Java's 'proxy' mechanism to construct a real object which delegates to the
KernelRunner (I think I need to draw a diagram for this ;) ). Java 8's lambdas
solve this using method handles and synthetically generated inner-classes on
the fly.
The other issue I encountered was dealing with arrays/buffers which are
accessed inconsistently (RW vs RO vs WO) depending on the entrypoint. Because
we have no idea what order of dispatch might take place we may need to fall
back to the 'minimal' restriction. So if entrypoint E1 accessed A RW and
entrpoint E2 accesssed A RO we define the buffer as RW and always pass back and
forth between calls.
The latter can be simplified a little by forcing explicit buffer transfers when
using multiple entrypoints.
I do think we need one 'context' (OpenCL command queue) shared between all
possible entrypoints. The KernelRunner can act in this role I think. I am not
sure another level of abstract is needed.
I am very interested in this work, like I said I have approached this multiple
times already and got overwhelmed ;) I really do welcome someone with a fresh
pair of eyes and a different perspective taking a crack at this.
If you would like to bounce ideas around, I would be more than happy to do
this.
Original comment by frost.g...@gmail.com
on 11 Jul 2013 at 2:21
Hi,
sounds good :-). The current state is that I really split up Kernels from the
KernelRunner, which now, as you described, acts a single holder for the JNI
context. I can even start multiple kernel objects, as long as they contain the
same arguments. What is missing is the managing of different kernel fields.
I'll find a solution for this ;-)
Matthias
Original comment by matthias.klass@gmail.com
on 11 Jul 2013 at 2:28
Hi Matthias,
This is an excellent issue request...and is a dupe of
http://code.google.com/p/aparapi/issues/detail?id=21.
But no worries, what you are describing in this issue speaks to some
discussions we've had in the past. See the following:
Specifically, Comment #3:
http://code.google.com/p/aparapi/issues/detail?id=105#c3
http://code.google.com/p/aparapi/issues/detail?id=104
I think that decoupling a number of the classes and changing the way OpenCL is
executed will work towards a number of goals. One thing that would be very nice
to see, for example, would be a Kernel accept another Kernel as an argument
allowing us to chain calls.
I'll keep track of this discussion. In the next few weeks, I will have some
more time available and plan to take another look at a couple of things in
Aparapi. I also plan to submit a fairly rigorous Correlation Matrix test case
that I need some eyeballs to look at for performance modifications. Or
potentially use it as a test case for issue tickets like this one :)
Original comment by pnnl.edg...@gmail.com
on 11 Jul 2013 at 10:19
I'll try to keep you up to date, which is why I am posting my progress today:
* I'll commit any changes to https://github.com/klassm/aparapi-clone, where you
will have a chance to watch my progress. If you want, we can later merge it
back to a svn branch.
* The KernelRunner by now got a map of multiple JNIContext values. Whenever a
kernel run is scheduled, the right one is pulled from the map and executed.
(BTW: I somehow cleaned up the KernelRunner class a little bit, to split the
execute method into some more readable methods.)
* Internally, on JNI side, I plan to have the JNIContext objects to map to the
same OpenCL context. This is why the OpenCL init will move to some global
object where all JNIContexts can access it. This is still missing by now.
* Another thing which is missing, and which is what I am currently pondering
about, is how to make sure that KernelArgs can refer to the same Java object
references. I thought about mapping KernelArgs to GPUArgs (which represent java
objects currently on the GPU). That way, KernelArgs from multiple entrypoints
could refer to the same GPU memory locations. This is a bit tricky and, by now,
I am not really sure how I want to implement that.
When both issues above are done, I think it should be possible to execute
multiple kernels (or am I missing somethin?). Let's see!
Original comment by matthias.klass@gmail.com
on 15 Jul 2013 at 2:42
Very nice, I look forward to tracking your progress.
Original comment by pnnl.edg...@gmail.com
on 16 Jul 2013 at 1:31
Ok, finally it works. There might be still some bugs, but essentially it is
possible to execute multiple kernel entry points on the same JNI context.
Example:
https://github.com/klassm/aparapi-clone/blob/master/test/runtime/src/java/com/am
d/aparapi/test/runtime/MultipleKernelCall.java
This was a bigger change now, as I wanted to map the buffers of the KernelArgs
to each other. That way multiple kernel args can point to the same GPU memory -
which is pretty neat I guess. That for I implemented a BufferManager, which is
responsible for managing all the buffers refering to java objects as well as
cleaning up afterwards, so to not leave over any memory allocated on the GPU
after execution.
In addition, the OpenCL context moved to a global attribute to make it
accessibele from mulitple JNIContexts - this has the consequence, that the
execution device can only be set once (for the first kernel call). I'll have to
adjust the API accordingly.
The main test cases work with my implementation. However, some of the do not
work. I went to look whether this is because of me or whether this also happens
in trunk. Result: the tests also fail there. The concerning test cases are:
- Game of live (only black screen?)
- Issue 102 (OOP conversion)
- Issue 103
- UseStaticArray
So I'll go change the API :-)
Original comment by matthias.klass@gmail.com
on 17 Jul 2013 at 10:01
I just tried how my implementation behaved in terms of speed. Using my fluid
simulation, I cannot recognize any differences. The next step for me is to
change my fluid solver implementation to the new multiple entry points
implementation. As this is a pretty big use case I hope to uncover any hidden
errors. I'll report back on how it is going.
Original comment by matthias.klass@gmail.com
on 17 Jul 2013 at 2:54
Regarding game of life not working.
It works for me. If you have a smaller screen size, the [start] button may be
hidden (off the bottom of the screen) and so it may appear to just be a blank
screen.
If you are using Linux, clicking on the frame should make the start button
visible.
Can you check this.
BTW your exposing of the KernelRunner is interesting. Before Aparapi was called
Aparapi it was a much smaller project called 'Barista' and it allowed multiple
kernels to be executed by a single KernelRunner (which indeed held the context,
queue and managed buffers).
So in barista we would do something like
KernelRunner kr = new KernelRunner();
Kernel k1 = new Kernel(){
};
Kernel k2 = new Kernel(){
};
kr.run(k1);
kr.run(k2);
;)
Early comments indicated that for simple examples exposing the KernelRunner was
too verbose.
Now we have evolved a little I think that we should have kept the KernelRunner
as a standalone class. It would also help with explicit buffer management.
Thanks for putting this together. I will download your repository soon and
give this a whirl.
This is very interesting work.
Gary
Original comment by frost.g...@gmail.com
on 17 Jul 2013 at 3:06
Hi Gary,
you were right. The start button was in deed hidden - I should have seen that
...
Because you started this - what was the origin of Aparapi? I always thought
that the framework has its roots within AMD. Concerning Barista I did not find
anything on Google - an internal framework? This would be just interesting for
the final presentation of my master thesis - some background information on the
framework I chose :-).
I also thought about making the KernelRunner a singleton and letting the user
continue to call execute directly on the Kernel. However, if the user calls
dispose on the KernelRunner, the whole instance will be disposed. We would need
a much more elaborated lifecycle then. This would be a nice thing I guess -
there's always something to do :-)
Matthias
Original comment by matthias.klass@gmail.com
on 17 Jul 2013 at 3:14
Barista was the internal name at AMD. Two weeks before our first public
'reveal' (JavaOne 2010) there was an internal AMD request to change the name (I
think there was an established open source project with this name).
The name Aparapi was the result of this last minute scramble :)
Barista/Aparapi was started after I was asked to write a Java based app which
used OpenCL for 'SuperComputing' 2009. Whilst I was happy to learn OpenCL to
write the required OpenCL code (I think I used JOCL as the binding - and it
worked well!) I came to the conclusion that most Java developers would prefer
not to do this.
I was a big fan of Java tools JAD and Mocha (which both create perfectly
serviceable Java source from bytecode) so I decided to see how hard it might be
to parse bytecode and turn it into OpenCL. The basic (very crude - enough to
run NBody example) bytecode to OpenCL engine took around 3 weeks over Christmas
of 2009.
The hardest part (and the part we are still struggling with) is how much to
expose to the Java developer....
Gary
Original comment by frost.g...@gmail.com
on 17 Jul 2013 at 5:50
Thanks for the background info! So the development is still supported by AMD or
has this shifted in the meantime towards open source / leisure time :-)?
To be honest, I also like this native kind of GPU binding. For my evaluation, I
also looked at JCuda, which is the equivalent of JOCL for CUDA. This is by far
the fastest framework I could find to program GPUs within Java. Partially, it
is over 40-100 times faster than other frameworks. And it is not even that bad
to program...
Matthias
Original comment by matthias.klass@gmail.com
on 18 Jul 2013 at 9:02
OK, finally the execution completely works. You might want to have a look at
the implementation. As final test, I ported the fluid simulation to the new API
and backend. This one uses explicit buffer handling and 14 kernels.
To give you an impression how much those multiple entrypoints change the way to
use the framework, I created two class diagrams:
- solver using the old API:
http://www.hs-augsburg.de/~klassm/simulator_structure.png
- solver using the new API:
http://www.hs-augsburg.de/~klassm/simulator_structure_new.png
The new image does not contain all the information, as the image would have
been too unclear. Instead, I added only the kernels themselves. The surrounding
stayed the same.
By using the multiple entries I could finally split the one whole kernel into
multiple java objects. The individual kernel arguments are mapped by a
BufferHandler-class in c++ to ArrayBuffers and OpenCL memory.
... and finally a small video what the simulator looks like:
http://www.hs-augsburg.de/~klassm/simulator.mkv
Matthias
Original comment by matthias.klass@gmail.com
on 23 Jul 2013 at 1:55
Matthias,
Nice work, and thanks for the video (I have been sending links to folk on our
team ;))
I plan to take a deeper look at this, when I get some time.
Gary
Original comment by frost.g...@gmail.com
on 23 Jul 2013 at 2:29
Jip sure. By the way - as of execution time: Aparapi is about 5 times slower
than a native implementation (measured from my fluid solver example). This
should be pretty representative, as loads of kernels are executed. Aparapi also
takes about 1.5 times the execution time of JCuda (which is only a JNI wrapper.
This info is taken from various benchmarks incl. Mandelbrot and Matrix
Multiplication. Just as info ...
Original comment by matthias.klass@gmail.com
on 23 Jul 2013 at 2:48
Thanks for the #'s.
Do you also have a sense of the performance relative to a pure Java solution?
Gary
Original comment by frost.g...@gmail.com
on 23 Jul 2013 at 3:21
Hi,
sure, but currently only for Matrix Multiplication. The other tasks
(Mandelbrot, Conjugate Gradient, Jakobi iterations) are still running and
taking forever ...
Aparapi is currently about 20x faster than a serial implementation.
Matthias
Original comment by matthias.klass@gmail.com
on 23 Jul 2013 at 3:24
Hi,
multiple entries now resulted in a more or less bigger refactoring. Before the
restructuring, I used a global command queue. This one does not really work, as
multiple KernelRunners would interfer with each other.
To sidestep this behaviour, I changed some things:
* JNIContext => KernelContext
(represents the context for a single kernel being executed natively)
* KernelRunnerContext (new)
(represents the context for a single kernelRunner, now includes a command queue and all the openCL dependent attributes)
In order to make this work, most of the JNI methods got an additional parameter
referencing the kernelRunnerContext address (the same hook as previously used
for the JNIContext).
Because I already was refactoring, I additionally removed the aparapiBuffer and
arrayBuffer attributes from KernelArgs and introduced a single buffer object of
type GPUElement*.
ArrayBuffer and AparapiBuffer now derive from GPUElement. Using this
polymorphism, I could delete a whole bunch of code.
Finally, I changed the behaviour of the newly implemented BufferManager. Its
responsibility is to look after all instantiated buffers and make sure that all
buffers without any reference are freed. Previously, I looped over all
JNIContexts (alias KernelContexts), afterwards over all KernelArgs and tried to
figure out which elements in my global buffer queue were not referenced. Now I
just keep an integer within the buffer indicating how many times they are
referenced. If the reference count is 0, I can free them. This skips 4
expensive loops and speeds up execution.
Just to keep you up to date ...
Matthias
P.S.: I'll create a class diagram to show this more clearly.
Original comment by matthias.klass@gmail.com
on 1 Aug 2013 at 2:58
"Finally, I changed the behaviour of the newly implemented BufferManager. Its
responsibility is to look after all instantiated buffers and make sure that all
buffers without any reference are freed. Previously, I looped over all
JNIContexts (alias KernelContexts), afterwards over all KernelArgs and tried to
figure out which elements in my global buffer queue were not referenced. Now I
just keep an integer within the buffer indicating how many times they are
referenced. If the reference count is 0, I can free them. This skips 4
expensive loops and speeds up execution."
Just a thought, I haven't had a chance to look at this code, but based on what
you wrote in quotes is this a candidate for WeakHashMap instead of manual
reference counting?
Original comment by pnnl.edg...@gmail.com
on 1 Aug 2013 at 7:19
Ok finally the two class diagrams, one for the native and one for the java side.
@pnnl: Yes, something of that kind would be really nice. However, WeakHashMap
is Java, or is there an equivalent for C++? I do not really want to add a
library dependencies just to replace one reference counter.
Matthias
Original comment by matthias.klass@gmail.com
on 2 Aug 2013 at 1:57
Attachments:
Another two graphics which might be interesting for you:
- The first one is about a comparison of a native fluid simulator against one
implemented in Aparapi. The x-axis represents the cell count of the simulation,
the y-axis the execution time in ms. The green curve is default Aparapi, the
blue one my adapted version with multiple entry points and the Device#best fix
(which is why the blue one is better, otherwise the time would have been the
same).
- The second one is about a comparison to other frameworks. Maybe you know your
competitors? The graph contains Delite (Stanford University), Rootbeer
(Syracude University), JCuda and native CUDA. The execution time of Rootbeer is
really high, which is due to multiple unnecessary kernel invokes. Delite has a
pretty huge overhead for CUDA execution. JCuda has an execution time of nearly
exactly the native CUDA execution (it is just the JNI overhead). Aparapi is
usually a little slower than native CUDA. In some exceptional cases Aparapi is
even a little faster (for example in Mandelbrot). This might be due to
concurrent copying...
Matthias
Original comment by matthias.klass@gmail.com
on 7 Aug 2013 at 3:18
Attachments:
Thank you for posting this.
Clearly Marco Hutter deserves some kudos for his JCuda work (and for JOCL which
is very well structured). I need to send this to Marco... he has always been
very supportive of Aparapi goals.
+ thanks for motivating me to look even closer at the Device.best() fix ;)
Gary
Gary
Original comment by frost.g...@gmail.com
on 7 Aug 2013 at 6:51
@matthias I've been really hoping for an update like this. Looking at
programming .cl code, this is pretty much how it is done. It has the benefit of
treating buffers as ro, rw, etc depending on the function, as well as making it
so much easier to program.
I've built your branch and am going to give it a go now, but if this could be
incorporated into the main project that would be fantastic.
A lot of the classes I use that require multiple entry get extended, and using
a mode buffer and if/else gets messy with inheritance.
Original comment by technoth...@gmail.com
on 28 Sep 2013 at 4:46
Also, just a thought, could this also solve some problems for 2d arrays (at
least when processing a single sub array at a time).
For example, for a simple feed forward neural network
Original comment by technoth...@gmail.com
on 28 Sep 2013 at 5:07
Attachments:
Hi,
nice that it works for you! Jip, I also would appreciate an integration into
main. However, I do not think that I am the best one to do that work - this
should be done by some AMD Aparapi developers like Gary for them to be able to
support the code base later on :-).
Matthias
Original comment by matthias.klass@gmail.com
on 8 Oct 2013 at 1:13
Any update on getting Gary's work into mainline AMD Aparapi? I also would like
to use multiple kernels on shared data and his proposal seems clean and simple
way of doing this.
Original comment by paulsou...@gmail.com
on 19 Oct 2014 at 5:32
Original issue reported on code.google.com by
matthias.klass@gmail.com
on 11 Jul 2013 at 9:36Attachments: