Closed fommil closed 10 years ago
Hi @fommil I started answering this thread/issue when I responded to your comments in Issue #15. I wanted to follow up with extra comments here.
Our BLAS (and likewise FFT) API is designed the way the way it is to unlock the complete potential of the heterogeneous platforms that OpenCL is designed for. In order to do that, our library can not make any guesses as to where the clients data resides. As such, the library is designed to allow the OpenCL client to completely manage their own data, meaning that the client controls where data lives and when it should be transferred to the host or the device. Our API's assume nothing, and we are not going to change this flexibility with the clMath API's; it has to be this way for performance.
However, I completely see value in interfaces that are easier to use. Interfaces that allow programmers to prototype functionality quicker, or just enables more developers to have access to the power of heterogeneous computing that is sitting on everybody's desktops and gaming rigs. Products like Accelereyes ArrayFire software, or your personal project that you link above have an important role to flesh out the software ecosystem. I hope to see solutions for everybody in the future, where a developer can understand and properly make tradeoff decisions between ease of use and performance that are right for their own compute needs.
@kknox it's not really just about being "easy to use". This is a compatibility issue: if you create a BLAS implementation, one certainly expects to be able to use the BLAS API to interface with it. What you've created is great, but it's not BLAS. It's BLAS-like.
To take a leaf from the LAPACKE (official C API to LAPACK): they offer version that allow the user to control memory allocation and an easy to use one that simply creates arrays on demand. I believe you should offer something similar: i.e. an API that matches Fortran BLAS exactly plus the one you're currently exposing. Surely there must be a way to setup the GPU device on shared library load, and close it down cleanly on exit (or abnormal exit / segfault!) .. such that the only overhead is the memory transfer. From my experiments with cuBLAS (which I believe is closer to the latter API) the memory transfer starts to be negligible (compared to the computational benefits) for arrays of 100,000 elements and more.
(I am aware and see clearly the performance advantage of leaving the arrays on the GPU memory space. That is clearly another level of optimisation that people may wish to make: but it requires significant source code change)
NOTE: I may well have fallen victim to NULL
operations with the cuBLAS also. When I'm next at my desktop machine, I'm going to run tests at the same time as performance runs, to ensure that the DGEMM is actually taking place.
Closing old clBLAS issues for the new year
This idea of creating an API to match the BLAS API should be handled in a wrapper to the clBLAS library, or implemented in another library altogether. The issue of managing OpenCL state is complicated and was not a design goal of this project.
It is entirely possible to create a project that matches the BLAS API and completely hides the OpenCL details from the end user, and this new wrapper/library could call into clBLAS to implement and manage the OpenCL kernels.
I wrote this up in my own project as it applies to all the GPU implementations
https://github.com/fommil/netlib-java/issues/50
If we could work out the dynamic library loading bit, this sounds like a pretty easy thing to do (but incredibly monotonous!). Possibly worth a feasibility with DDOT/DGEMM.