Merging with JCuda and JOpenCL projects for better quality cuda interfaces

archenroot commented 6 years ago

@saudet Hi buddy, it just came to my head in last few weeks what about merging cuda and opencl stuff here with work of guys from Jcuda and Jopencl projects. I understand there are some fundamental differences, but having rather more quality devs on single project could provide project quality as well.

The guys from JCuda opened discussion on my request here: https://forum.byte-welt.net/t/about-jcuda-and-javacpp/19538

So, if you think it could bring more value as well, you are free to join the discussion.

saudet commented 6 years ago

That would be nice, but the problem is that people expect Oracle to come up with a better solution than JavaCPP, even though they are not working on anything at the moment. As far as I can tell, the developers of Project Panama have given up on any generic solution to C++, no one knows how to make something better than JavaCPP. Still, they hope and believe, and wait, mostly. If you could help convincing them that nothing better is going to happen, explaining and reexplaining over and over again how JavaCPP could get better, that would be the first thing that needs to be done.

archenroot commented 6 years ago

Just for reference: https://github.com/jcuda/jcuda/issues/12#issuecomment-335010118

archenroot commented 6 years ago

@saudet I am reading about it it actually started long time ago..(the panama project). What is that project based on, JNI or something new?

Anyway I registered under the project mail list, but to be hones I went trough some links on project site, some repository links are broken, and also blog sites of the main creators/devs are not updated for long time... I will check more and read about this.

jcuda commented 6 years ago

There is a lot happening in Panama (the project...) right now. Admittedly, although I'm registered to the mailing list, too much to follow it all in detail. However, if they manage to achieve the goals that they stated at the project site, http://openjdk.java.net/projects/panama/ , this would certainly compete with JavaCPP.

Of course, development there happens at a different pace. We all know that a "single-developer project" often can be far more agile than a company-driven project, where specifications and sustainability play a completely different role. Panama also approaches topics that go far beyond what can be accomplished by JavaCPP or JNI in general. They are really going down to the guts, and the work there is interwoven with the topics of Value Types, Vectorization and other HotSpot internals.

So I agree to saudet that it does not make sense to (inactively) "wait for a better solution". JavaCPP is an existing solution for (many of, but by no means all of) the goals that are addressed in Panama.

More generally speaking, the problem of fragmentation (in terms of different JNI bindings for the same library) occurred quite frequently. One of the first "large" ones had been OpenGL, with JOGL which basically competed with LWJGL. For CUDA, there had been some very basic approaches, but none of them (except for JCuda) have really been maintained. When OpenCL popped up, there quickly have been a handful of Java bindings (some of them being listed at jocl.org and in this stackoverflow answer ), but I'm not sure about how actively each of them is still used and maintained.

(OT: It has been a bit quiet around OpenCL in general recently. Maybe due to Vulkan, which also supports GPU computations? When Vulkan was published, I registered jvulkan.org, but the statement "Coming soon" is not true any more: There already is a vulkan binding in LWJGL, and the API is too complex to create manual bindings. There doesn't seem to be a Vulkan preset for JavaCPP, or did I overlook it?)

For me, as the maintainer of jcuda.org and jocl.org, one of the main questions about "merging" projects would be how this can be done "smoothly", without just abandoning one project in favor of the other. I always tried to be backward compatible and "reliable", in that sense. Quite a while ago, I talked to one of the maintainers of Jogamp-JOCL, about merging the Jogamp-JOCL and the jocl.org-JOCL. One basic idea there had been to reshape one of the libraries so that it could be some sort of "layer" that is placed over the other, but this idea has not been persued any further.

I'm curious to hear other thoughts and ideas about how such a "merge" might actually be accomplished, considering that the projects are built on very different infrastructures.

saudet commented 6 years ago

I am also registered to the list, but I'm not seeing anything happen. Could you point me to where, for example, they demonstrate creating an instance of a class template? I would very much like to see it. Thanks

saudet commented 6 years ago

Yes, JCuda, etc could be rebased on JavaCPP, that's the idea IMO. There are no bindings for OpenCL or Vulkan just because I don't have the time to do everything, that's all.

archenroot commented 6 years ago

@jcuda @saudet Little offtopic, but related: I am very interested in JNR, but to be honest, I wasn't able to find any kind of benchmarking or even some detailed comparison. Before we had JNA and JNI, while JNA was slow and easy to use, but for high-performance stuff you need performance, so we go with JNI where possible, right? That is also the way of JavaCPP and JCuda as well. Could you guys put here some reference document comparing JNR to JNI from performance perspective? I would love to understand the internal architecture of JNR to see especially performance benefits over JNI, I am aware it is far beyond performance only, but when you run 200 node cpu/gpu cluster, the performance (throughput and latency) matters. The complexity of adoption could be handled always :-)

saudet commented 6 years ago

I know about these links for JNR: http://www.oracle.com/technetwork/java/jvmls2013nutter-2013526.pdf https://github.com/bytedeco/javacpp/issues/70

archenroot commented 6 years ago

@saudet thanks buddy,

I also suggest to move the discussion about jcuda vs javacpp to marco's thread at, as he requested: https://forum.byte-welt.net/t/about-jcuda-and-javacpp/19538/3

NOTE: I think out of theoretical discussion, as performance is the top priority I suggest if you @saudet create under JavaCPP new github project where we can develop real benchmark for Jcuda and Javacpp based CUDA (as Vulkan and OpenCL are not available in the moment), so we can analyze code syntax diff/similarities and performance as well in some unified way.

I also suggest to decide which benchmark framework should be used to build this stuff:

Or here: https://stackoverflow.com/questions/7146207/what-is-the-best-macro-benchmarking-tool-framework-to-measure-a-single-threade

saudet commented 6 years ago

Sure, but who will take time to do? I keep telling everyone I don't have the time to do everything by myself...

archenroot commented 6 years ago

I will create initial project and adopt few basic CUDA algorithms to be implemented in Jcuda and javacpp, I hope we could find more users from the other side (jcuda) to participate as well.

saudet commented 6 years ago

Ok, cool, thanks! Can we name the repo "benchmarks"? or would there be a better name?

archenroot commented 6 years ago

I think make it this generic best, so benchmarks sounds good. As out of of this I would like to also later (if having time) to test JavaCPP vs JNR in some simple dummy getuuid functions call tests from libc as kind of template:

#include <uuid/uuid.h>
void uuid_generate(uuid_t out);void uuid_generate_random(uuid_t out);void
uuid_generate_time(uuid_t out);int uuid_generate_time_safe(uuid_t out);

jcuda commented 6 years ago

I am also registered to the list, but I'm not seeing anything happen. Could you point me to where, for example, they demonstrate creating an instance of a class template? I would very much like to see it. Thanks

Again, I'm not so deeply involved there, but their primary goal is (to my understanding) not something that is based on accessing libraries via their definitions in header files. My comment mainly referred to the high-level project goals (i.e. accessing native libraries, basically regardless of which language they have been written in), together with the low-level efforts in the JVM. At least, there are some interesting threads in the mailing list, and the repo at http://hg.openjdk.java.net/panama/panama/jdk/shortlog/d83170db025b seems rather active.

Regarding the benchmarks: As I also mentioned in the forum, creating a sensible benchmark may be difficult. Even more so if it is supposed to cover the point that is becoming increasingly important, namely multithreading. But setting up a basic skeleton with basic sample code could certainly help to figure out what can be measured, and how it can be measured sensibly.

(As for the topic of merging libraries, the API differences might actually be more important, but this repo would automatically serve this purpose, to some extent - namely, by showing how the same task is accomplished with the different libraries)

archenroot commented 6 years ago

@jcuda

Thanks for your comments. Actually based on the presentation it even looks they have added even more processing layers than JNI has :-))), but I will need to investigate the whole story more. Thanks for link.

Regarding benchmark: that is the point, establishing kind of skeleton. By multithreading you mean CPU multithreading?, I think it will be good along with template definition to discuss possible algorithms to be implemented and their general specification. good point.

That is exactly the point, because I also do not now how the differences are big in the moment, how big breakthrough we talk about.

saudet commented 6 years ago

@archenroot I created the repository and gave you admin access: https://github.com/bytedeco/benchmarks Feel free to arrange it as you see fit and let me know if you need anything else! Thanks

archenroot commented 6 years ago

@saudet good starting point, I will try to do as discussed: prepare common benchmark structure/template and list of interesting algorithms (including of course multi-threaded from client perspective).

I am thinking to in some cases provide as well existing C/C++ implementation if available to compare native performance, but will focus on Jcuda vs javacpp at first.

Thanks again.

jcuda commented 6 years ago

By multithreading you mean CPU multithreading?

Yes. CUDA offers streams and some synchronization methods that are basically orchestrated from client side. (This may involve stream callbacks, which only have been introduced in JCuda recently, an example is at https://github.com/jcuda/jcuda-samples/blob/master/JCudaSamples/src/main/java/jcuda/driver/samples/JCudaDriverStreamCallbacks.java )

As for the other "benchmarks": Some simple matrix multiplication could be one that creates a real workload. Others might be more artificial, in order to more easily tune the possible parameters. Just a rough example: One could create a kernel that just operates on a set of vector elements. Then one could create a vector with 1 million entries, and try different configurations - namely, copying X elements and launching a kernel with grid size X (1000000/Y times). This would mean

process 100-element blocks, using 10000 copies/launches
process 1000-element blocks, using 1000 copies/launches
process 10000-element blocks, using 100 copies/launches
process 100000-element blocks, using 10 copies/launches

(the kernel itself could then also be "trivial", or create a real workload by throwing in some useless sin(cos(tan(sin(cos(tan(x))))) computations...)

Again, this is just a vague idea.

saudet commented 6 years ago

FWIW, being able to compile CUDA kernels in Java is something we can do easily with JavaCPP as well. To get a prettier interface, we only need to finish what @cypof has started in https://github.com/bytedeco/javacpp/pull/138.

blueberry commented 6 years ago

@archenroot @jcuda May I add that the actual computation time of the GPU kernels is not that important for the benchmarks. What we need to measure here is an overhead over plain C/C++ cuda driver calls.

So, let's say that enqueing the "dummy" kernel costs X time. Java wrapper needs k X time. We are interested in knowing k1 (JCuda) and k2 (JavaCPP cuda), `k1X/X,k2X/Xor/andk1 X / k2 * X`.

In my opinion, k1 * X / k2 * X is the easiest to measure of those.

jcuda commented 6 years ago

Compiling CUDA kernels at runtime already is possible with the NVRTC (a runtime compiler). An example is in https://github.com/jcuda/jcuda-samples/blob/master/JCudaSamples/src/main/java/jcuda/nvrtc/samples/JNvrtcVectorAdd.java . (Of course one could add some convenience layer around this. But regarding the performance, the compilation of kernels is not relevant in most use cases). I'll have a look at the linked PR, though.

saudet commented 6 years ago

@jcuda Oh, interesting. It's nice to be able to do this with C++ in general and not only CUDA though.

jcuda commented 6 years ago

In fact, the other sample at https://github.com/jcuda/jcuda-samples/blob/master/JCudaSamples/src/main/java/jcuda/nvrtc/samples/JNvrtcLoweredNames.java shows that this also supports "true" C++, with namespace, templates etc.

(The sample does not really "do" anything, it only shows how the mangled names may be accessed afterwards).

The NVRTC was introduced only recently, and before it was introduced, one problem indeed was the lack of proper C++ support for kernels in JCuda: It was possible to compile kernels that contained templates by using the offline CUDA compiler (which is backed by a C++ compiler like that of Visual Studio). The result was a PTX file with one function for each template instance. But of course, with oddly mangled names that had to be accessed directly via strings from Java. With the NVRTC, this problem is at least alleviated.

saudet commented 6 years ago

But it doesn't help for C++ code running on the host, right? So, if I understand correctly, NVRTC doesn't help for something like Thrust: https://github.com/bytedeco/javacpp/wiki/Interface-Thrust-and-CUDA

jcuda commented 6 years ago

That's right. And the question was already asked occasionally, aiming at something like "JThrust". But I think that the API of thrust (which on some level is rather template-heavy) does not map sooo well to Java. I think that a library with a functionality that is similar to that of thrust, but in a more Java-idiomatic way.

(A while ago I considered to at least create some bindings for https://nvlabs.github.io/cub/ , as asked for in https://github.com/jcuda/jcuda-main/issues/11 , but I'm hesitating to commit to another project - I'm running out of spare time....)

saudet commented 6 years ago

@jcuda @archenroot @blueberry FYI, wrapper overhead might become more important since kernel launch overhead has apparently been dramatically reduced with CUDA 9.1:

Launch kernels up to 12x faster with new core optimizations

https://developer.nvidia.com/cuda-toolkit/whatsnew

jcuda commented 6 years ago

They don't give any details/baseline of what they compared. A dedicated benchmark or comparison with CUDA 9.0 and 9.1 might be worthwhile. (I haven't updated to 9.1 yet - currently, the Maven release of 9.0 is on its way...)

@archenroot Any updates on the benchmark repo?

saudet commented 6 years ago

In the meantime, I've released presets for CUDA 9.1 :) http://search.maven.org/#search%7Cga%7C1%7Cbytedeco%20cuda

archenroot commented 6 years ago

@jcuda - I am unfortunately busy with other projects in the moment and preparing for relocation with family in next 2 months, so I am not seeing the benchmark progress as feasible from my side in next 2-3 months...

@saudet - you are deadly warrior :-) thx for update.

saudet commented 6 years ago

FYI, commit https://github.com/bytedeco/javacpp-presets/commit/916b06032ecd00970c1bd8d2c2ac6bc7ac05e665 reduces the JNI wrapper overhead even further.

agibsonccc commented 5 years ago

@archenroot I'd love an update on some of your thoughts here since we're towards the end of the year. We'll be putting more resources in to javacpp and I want to understand what you think some of the gaps might be.

jcuda commented 5 years ago

Indeed, it has been a long time and https://github.com/bytedeco/benchmarks is still empty. I'll try to increase the priority of creating a set of benchmarks, but this should happen in close collaboration with @archenroot to make sure that the benchmarks can "co-evolve" for JavaCPP and JCuda.

tahaemara commented 5 years ago

I created a small repo about benchmarking different libs and native implementations (include jcuda and javacp presets for cuda) of matrix multiplication of size 2000x2000. The results are here https://github.com/tahaemara/multi-threaded-matrix-multiplication

saudet commented 5 years ago

@tahaemara You'll need to run at least a few times the same function for this to mean anything, ideally using something like JMH: https://openjdk.java.net/projects/code-tools/jmh/

jcuda commented 5 years ago

Broadly speaking, @saudet is right with the comment. The benchmark in its current form does not tell you too much about the actual performance (although it's a start, that's for sure). The problem of JIT warmup is a general one when benchmarking Java applications. But we're not only in the Java world here. The fact that many operations in CUDA are asynchronous also has to be taken into account. Beyond that, a more detailed analysis would/should/could include

Different matrix sizes. I think that for neural networks, 2000x2000 is not realistic. Covering the range from 10x10 over 100x100 to 1000x1000 would be nice
Specialized operations for GEMM. In CUBLAS, there is a dedicated "batched GEMM" (see for example https://github.com/jcuda/jcuda-samples/blob/master/JCudaSamples/src/main/java/jcuda/jcublas/samples/JCublas2SgemmBatched.java ). Whether comparing this to other implementations is still "fair" is another question...
Differentiation between the actual execution time and the memory copy time. Whether or not one can omit the memory copy time in a benchmark certainly depends on the application case (or on the message that you want to bring across ;-))
- (many more, probably...)

EDIT:

A side note: JMH may be good for plain Java applications, but I'm not entirely sure whether it can sensibly be used to benchmark (asynchronous) JNI functions. (It might be, I'm not so deeply familiar with it. But one has to keep in mind that it's primarily aiming at the JVM...)

saudet commented 5 years ago

JMH recommends to send output to "blackholes", for example: https://hg.openjdk.java.net/code-tools/jmh/file/default/jmh-samples/src/main/java/org/openjdk/jmh/samples/JMHSample_09_Blackholes.java So asynchronous operations are pretty much synchronized. It should give usable results.

archenroot commented 5 years ago

I am sorry I can help here at moment, I must clone myself first to support multitasking :-), but happy to read about progress/results.

breandan commented 4 years ago

In particular, it would be great to have OpenCL bindings due to the situation with CUDA on Mac.

saudet commented 4 years ago

Apple has also deprecated OpenCL and all development has stopped, so that's not going to help.

jcuda commented 4 years ago

Ouch. To have this here as well, quoting the statement from the release notes:

CUDA 10.2 (Toolkit and NVIDIA driver) is the last release to support macOS for developing and running CUDA applications. Support for macOS will not be available starting with the next release of CUDA.

@saudet To be honest: It has been quiet around OpenCL on all fronts recently.

The reasons are obvious for NVIDIA: They want to push CUDA and sell their cards. AMD has been a driving force to some extent, but in general, support for OpenCL 2.0 is rare and limited. I had some hope for OpenCL (due to libraries like CLBlast, and conferences like https://www.iwocl.org/ ). Some people might also be on the fence between OpenCL and Vulkan Compute (although the application cases should be clear).

From doing a quick search, it appears that Apple wants to focus on Metal. Skimming over the presets of JavaCPP, I haven't found presets for Metal - did I overlook them? Did you consider creating presets for Metal?

In any case, it seems like GPU/Compute support for Apple will become a very narrow field...

blueberry commented 4 years ago

I agree on the low hope for gpu compute on MacOS but OpenCL serves me rather well on Linux for both CPU and GPU compute. I think that OpenCL still has value on Linux and Windows, even if in shadow of CUDA.

On Tue, Nov 26, 2019 at 11:22 AM Marco Hutter notifications@github.com wrote:

Ouch. To have this here as well, quoting the statement from the release notes:

CUDA 10.2 (Toolkit and NVIDIA driver) is the last release to support macOS for developing and running CUDA applications. Support for macOS will not be available starting with the next release of CUDA.

@saudet https://github.com/saudet To be honest: It has been quiet around OpenCL on all fronts recently.

The reasons are obvious for NVIDIA: They want to push CUDA and sell their cards. AMD has been a driving force to some extent, but in general, support for OpenCL 2.0 is rare and limited. I had some hope for OpenCL (due to libraries like CLBlast, and conferences like https://www.iwocl.org/ ). Some people might also be on the fence between OpenCL and Vulkan Compute (although the application cases should be clear).

From doing a quick search, it appears that Apple wants to focus on Metal. Skimming over the presets of JavaCPP, I haven't found presets for Metal - did I overlook them? Did you consider creating presets for Metal?

In any case, it seems like GPU/Compute support for Apple will become a very narrow field...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bytedeco/javacpp-presets/issues/475?email_source=notifications&email_token=AAAZ2JAHY6SQUJHPRPE3ORTQVVENLA5CNFSM4D6G6V3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFGS7AQ#issuecomment-558706562, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAZ2JEZVUCA2S23FGK2M2LQVVENLANCNFSM4D6G6V3A .

saudet commented 4 years ago

Intel oneAPI should be working with any GPUs on Linux and Windows, so that looks like a promising candidate. We'll see if they add some kind of support for Mac, but it should be easy for them to do since Apple uses their GPUs for graphics anyway.

In any case, I'm totally fine with having presets for OpenCL, Metal, and Vulkan. It's just not a high priority for me or the community in general. You can't expect me to do everything by myself alone. That's not realistic. Though if I get pull requests for those or anything else for that matter, I'll be more than happy to merge them!

saudet commented 3 years ago

I've added presets for OpenCL 3.0 here: https://github.com/bytedeco/javacpp-presets/tree/master/opencl The API is pretty small, and the specs are pretty stable these days, so it should be easy to maintain. Vendor extensions are not currently mapped, that would require a little bit more work to get working, but please let me know if any of you need anything and I'll do it.

Same thing for CUDA. If there is anything missing from the presets for CUDA preventing you from using them, please let me know!

blueberry commented 3 years ago

Thank you. I'll check it out (althought I can't seriously look into this for at least severalmonths due to a pile of other stuff). Basically, I'll need typical OpenCL 2.0 host api calls. There are not many of them that are not supported in 2.0 (most 2.0 stuff that I used is device-side and are thus independent of the API) but I need them. If I remember well, there were only a few key functions that were different: something related to command queue creation, and, if I remember well, a function for enqueueing the nd range had a few more variants. Unfortunately, after a detailed search, I can't find a web document that lists these differences in an easy way (even though I remember that such document existed 5 years ago :) I'll have to try to port the actual code to be able to tell you exactly what's missing :(

I'd need to create bindings for CLBlast (but that might be a good opportunity to learn bytedeco)

Why I think OpenCL 2.0 features support is important: Intel supports them for their CPUs and GPUs so these might be a good hook points for accelerating Java numeric code on the CPU with AVX. Also, AMD supports these features, although for AMD , the HIP support would be even better. I guess that existing bytedeco's CUDA bindings might be a good spring board, since the API should be mostly identical. If I knew bytedeco I would work on this, but there is always that damn issue of needing some time to actually learn bytedeco's system properly. Is there a plan to support HIP btw? I hope I'll come to work on this one day, but my current stumbling point is that I have the infrastructure that works well, lack time to fill it with more features, while I need to learn bytedeco's generator machinery with my limited knowledge of C++. I'd love to...

saudet commented 3 years ago

All the API for OpenCL 2.0 should be there, yes. If there is anything missing I'll fix it when we find it! CLBlast doesn't look too hard to support either, but first things first, let me know if there is anything missing from the presets for OpenCL itself.

HIP and friends, well, there's a whole lot of minor APIs like that. I don't have the time either to do everything by myself! I was hoping oneAPI would take care of abstracting everything, but it's obviously not going to happen. :( Something's bound to show up at some point though. I'd wait and see what happens over the next few months, and if there's still nothing available... It feels to me that something like TVM might very well start to become useful as a general computational framework:

https://tvm.apache.org/docs/dev/relay_intro.html

It already has backends for CUDA, OpenCL, Vulkan, Metal, ROCm, DSPs, FPGAs, etc and it's working pretty well, even from Java:

http://bytedeco.org/news/2020/12/12/deploy-models-with-javacpp-and-tvm/

jcuda commented 3 years ago

Since this might be relevant for this issue thread: I'll probably have to abandon JCuda at some point. NVIDIA is filling it with crazy structures and untestable functions. Although I have some code generation infrastructure, there is still some manual work involved. The recent additions to CUDA (particularly the "graph-execution" ones mentioned in a related rant) cannot sensibly be mapped to Java unless one uses the approach of saying "here's some memory - I don't know what it means, but you can access it in the way that is defined in some header file" (aka offering it as a Pointer with some automatically generated functions for accessing the underlying data). Any approach that would require even the slightest understanding of what these structures are or how they should be used is doomed to fail, because the effort for understanding them (at the pace in which they are introduced) is prohibitively large. (At least, I cannot do this any more, in my spare time).

I don't know how large the actual user base of the CUDA presets is. But the VectorAddDrv sample looks clean (except for that PTX string, but that could probably be replaced with an nvRTC call). The CUDA runtime binaries are provided in Maven Central by JavaCPP (which I could not do, for a variety of reasons). So it's certainly a viable (and maybe more sustainable) approach for using CUDA from Java.

saudet commented 3 years ago

@jcuda The central statistics for the CUDA presets look like this (numbers for December aren't in yet it seems): (I don't know what's been happening since August, but it looks like something is happening there. 136,593 downloads for November are from the same IP... Maybe some rogue CI server gone wild somewhere.)

In any case, my goal with JavaCPP was never to provide clean APIs for end users, but to provide developers like you with the tools necessary to work on high-level idiomatic APIs. The kind of tools that nearly all Python developers take for granted, but for some reason most Java developers, even those at Oracle, prefer to write JNI manually, such as with the work that @Craigacp has recently been doing for ONNX Runtime. Another case in point, Panama has officially dropped any intentions of offering something like JavaCPP as part of OpenJDK, see http://cr.openjdk.java.net/~mcimadamore/panama/jextract_distilled.html. What they are saying essentially is that since they haven't been able to come up with something that's perfect, that they can confidently support for the next century or so (I'm exaggerating here, but that's not far from the truth), they will leave this dirty work to others like myself and yourself! :) So, please do consider rebasing JCuda and JOCL on JavaCPP. People who really wish to use the crappy parts of the CUDA API will be able to, while you can concentrate on offering some subset of it that makes sense to most Java users. TensorFlow has done it and they even got a speed boost over manually written JNI, see https://github.com/tensorflow/java/pull/18#issuecomment-579600568. MXNet has also dropped their manually written JNI too and may choose to continue either with (slow) JNA or (faster) JavaCPP, see https://github.com/apache/incubator-mxnet/issues/17783.

In any case, if you still feel strongly against using a tool like JavaCPP, please let me know why! The engineers at NVIDIA certainly haven't been very clear about why they consider tools like Cython, pybind11, setuptools, and pip to be adequate for Python, but not for Java where for some reason everything has to be redone manually with JNI for each new project, see https://github.com/rapidsai/cudf/pull/1995#issuecomment-504342459. /cc @razajafri

jcuda commented 3 years ago

So, what's happening since August...?

JcudaStats

Maybe people (or at least, one or few "large" users) are moving from JCuda to JavaCPP...

In any case, my goal with JavaCPP was never to provide clean APIs for end users, but to provide developers like you with the tools necessary to work on high-level idiomatic APIs.

Originally, my goal of JCuda was also to address two layers:

The 1:1 low-level JNI bindings. Just offering what is there, exactly as it is, regardless of whether it makes sense for Java or not. (This includes obvious things, like process(int *array, int length) that should be process(int array[]) in Java, but also many others)
A somewhat object-oriented, idiomatic, easier-to-use API on top of that

I didn't really tackle the latter. It would be easy to offer some abstraction layer that covers 99% of all use cases (copy memory, run kernel - that's it). But designing, maintaining and extending this properly could be a full-time job.

The direct JNI bindings had been manageable... until recently. I have some parsing- and code generation infrastructure (which, in turn, is far away from being publishable). But the general approach of memory/Pointer handling hit some limits with the recent CUDA API extensions.

I talked with some of the Panama guys a while ago. Part of this discussion was also about ~"the right level of abstraction". I'm generally advocating for defining clear, narrow tasks. Creating a tool that does one thing, and does it right. Or as indicated by the two steps mentioned above: Defining a powerful (versatile), stable (!) API, and build the convenience layer based on that.

I didn't manage to follow the discussion on the Panama mailing list in all detail. But I can roughly imagine the difficulties that come with designing something that is supposed to be used for literally everything (i.e. each and every C++ library that somebody might write), and doing this in a form that is stable and reliable.

^{(And by the way: I highly appreciate the fact that Oracle puts much emphasis on long-term stability and support. Today, I can take a Java file that was written for a 32bit Linux with Java 1.2 in 1999, and drag-and-drop it into my IDE on Win10 with Java 8, and it just works. Period. No updates. No incompatibility. No hassle. No problems whatsoever. Maybe one only learns to appreciate that after being confronted with the daunting task of updating some crappy JS "web-application" from Angular 4.0.1.23 to 4.0.1.23b and noticing that this may imply a re-write. Stability and reliability are important)}

I only occasionally read a few mails from the Panama mailing list, and noticed that the discussion is sometimes ... *ehrm*... a bit heated ;-) and this point seems to be very controversial. But I cannot say anything technically profound here, unless I invest some time to update and get an overview of the latest state. So ... the following does not make sense (I know that), and may sound stupid, but to roughly convey my line of thought: Could it be that, at one day, Panama and JavaCPP work together? E.g. that Panama can generate JavaCPP presets, or JavaCPP presets can be used in Panama? I think that one tool addresing a certain layer, or having a narrower focus than another, does not mean that the tools cannot complement each other...

An aside:

People who really wish to use the crappy parts of the CUDA API will be able to, while you can concentrate on offering some subset of it that makes sense to most Java users.

I'd really like to do that, for some parts of the CUDA API. It lends itself to an Object-Oriented layer quite naturally.

Kernel kernel = Platform.compile("kernel.cu");
Device device = Platforms.get(0).getSomeDevice();
Memory input = device.receive(someArray);
Memory output = device.allocate(n);
device.execute(kernel, input, output);
...

And even more so for the new "Graph Execution" part of the API that my rant was about (I'm a fan of flow-based programming - that's why I created https://github.com/javagl/Flow , and having "CUDA modules" there would be neat...). But the point is: Nobody wants to use these parts of the CUDA API. People think that they have to use it, for profit, and will use it. They will hate it, but they will use it. And NVIDIA knows that, so they obviously don't give the slightest ... ... care... about many principles of API design.

So, please do consider rebasing JCuda and JOCL on JavaCPP. + In any case, if you still feel strongly against using a tool like JavaCPP, please let me know why!

I don't feel strong against using a tool like JavaCPP, and already mentioned elsewhere: If JavaCPP had been available 10 years ago, I probably wouldn't have spent countless hours for JCuda (including the parsing and code generation infrastructure). I have to admit that I haven't set up the actual JavaCPP toolchain, for the actual creation of code, because I'd have to allocate some time for https://github.com/bytedeco/javacpp-presets/wiki/Building-on-Windows , but it would certainly be (or have been) less effort in the long run...

Regarding rebasing JCuda on JavaCPP: I think we already talked about that, quickly, in the forum. It might be possible to do that to some extent. But I have some doubts. Very roghly speaking:

It may not be worth the effort. Introducing a layer that translates a call like JCuda.cudaMalloc(jcudaPointer, 4); into the appropriate cudart.cudaMalloc(javacppPointer, 4); may be justified by trying to change the basis of JCuda and maintaining some backward compatibility at the same time. But essentially, they are both only 1:1 JNI bindings of the CUDA API, so the differences at the surface level may not warrant the effort of adding such a translation layer for a few thousand functions. (Also, JCuda is a spare time project. What's in for me there? If someone has some spare $$.$$$,$$, I'll do this, no problem...)
It may have a slight disadvantage in terms of performance
Such a translation may not be possible in all cases

The last one refers to one point that I'm not sure about in JavaCPP. To my understanding, when creating an IntPointer from an int[] array, like https://github.com/bytedeco/sample-projects/blob/master/cuda-vector-add-driverapi/src/main/java/VectorAddDrv.java#L44 , then the memory will also (immediately) be allocated and filled on native side. If this is true, then imagine code like this:

int array[] = new int[100000000]; // 100 million ints - ~400 MB
IntPointer a0 = new IntPointer(array); // This will allocate and copy 400 MB...
IntPointer a1 = new IntPointer(array); // This will allocate and copy 400 MB...
IntPointer a2 = new IntPointer(array); // This will allocate and copy 400 MB...
...

In JCuda, I deliberately tried to allow a "Pointer to an int[] array" as a "shallow object", meaning that it does not do any copies or allocations. One could call this a "more natural" integration of Java arrays (despite all the difficulties that come along with that - garbage collection, relocation...). If the creation of an IntPointer implied an allocation+copy, then one would have to be very careful to avoid patterns like the one above. (And still, even if only one copy is created, it may get into the way of people who deal with ""Big Data®""...). It could probably still be possible to handle this in a thin translation layer, but may require some care to do it right.

mcimadamore commented 3 years ago

I only occasionally read a few mails from the Panama mailing list, and noticed that the discussion is sometimes ... ehrm... a bit heated ;-) and this point seems to be very controversial. But I cannot say anything technically profound here, unless I invest some time to update and get an overview of the latest state. So ... the following does not make sense (I know that), and may sound stupid, but to roughly convey my line of thought: Could it be that, at one day, Panama and JavaCPP work together? E.g. that Panama can generate JavaCPP presets, or JavaCPP presets can be used in Panama? I think that one tool addressing a certain layer, or having a narrower focus than another, does not mean that the tools cannot complement each other...

Hi, I'm Maurizio and I work on Panama - I think what you suggest is not at all stupid/naive. The new Panama APIs (memory access + foreign linker) provide foundational layers to allow low-level memory access and foreign calls. This is typically enough to bypass what currently needs to be done in JNI/Unsafe - meaning that, at least for interfacing with plain C libraries, no JNI glue code/shared libraries should be required. It is totally feasible, at least on paper, to tweak JavaCPP to emit Panama-oriented bindings instead of JNI-oriented ones (even as an optional mode). While this hasn't happened yet, I don't think there's a fundamental reason as to why it cannot happen. I know of some frameworks (Netty and Lucene to name a few) who have started experimenting a bit with the Panama API, to replace their current usages of JNI/Unsafe, so it is possible. Of course, since we're still at an incubating stage, there might be some hiccups (e.g. some API points might need tweaking, and/or performance numbers might not be there in all cases) - but we're generally trying to improve things and managed to do so over the last year.

jcuda commented 3 years ago

@mcimadamore We talked a bit via mail, and I gave jextract a try in https://mail.openjdk.java.net/pipermail/panama-dev/2019-February/004443.html , but it has been quite a while ago, a lot has happened in the meantime, and I'm not really up to date.

^{(There's something paradox about the situation that I spend spare time for JCuda, instead of Panama, while the latter could help to spend less time for JCuda ... :-/ )}

While this hasn't happened yet, I don't think there's a fundamental reason as to why it cannot happen.

From a birds-eye perspective (and not being deeply familiar with JavaCPP, I don't have another perspective... yet), my thought was that it might eventually be possible to replace the Generator.java with something that emits Panama bindings. The Generator class might benefit from a refactor, though. Right now, the pattern is that code is emitted based on certain conditions:

...
if (!functions.isEmpty() || !virtualFunctions.isEmpty()) {
    /* write lots of code */
}
for (Class c : jclasses) {  
    /* write lots of code */
}
for (Class c : deallocators) { ... }
if (declareEnums)  { ... }

In fact, there are some similarities to my code generation project. I tried to establish "sensible defaults", but still make it possible to plug in CodeWriter instances at each and every level, based on certain conditions ... roughly like that...

// Define how pointer declarations are written
functionDeclarationWriter.getDeclarationWriter().prepend(
    ParameterPredicates.parameterHasType(TypePredicates.isPointer()),
    new WriterForAllPointers();

// Define the code for initializing a certain parameter...
functionDeclarationWriter.getInitNativeWriter().prepend(
    ParameterPredicates.parameterMatches("methodRegEx\*", "parameterName"),
    new SpecialInitializationWriterForThisParameter();

This may be over-engineering, but conceptually, it's the attempt to abstract what's currently done in the Generator. In general, breaking the 4200-LOC-monolith Generator into a handful of XyzGenerator-classes (and ... ~2000 lines that are replaced with something like out.print(templateCodeFromFile("adaptersTemplate.c")) ...) could allow dedicated emitters, or to "address different backends", so to speak.

But again, that's just brainstorming. I know that it's never as easy as it looks on this level...

bytedeco / javacpp-presets

Merging with JCuda and JOpenCL projects for better quality cuda interfaces #475