Question: Work on different-design-choices version of the C++ bindings?

Hello OpenCL people :-)

First - thank you and the consortium for maintaining this repository along with the C-ish APIs. It's important and useful.

I'm the author of cuda-api-wrappers (the Modern-C++ wrappers for the core CUDA APIs). Recently, I've gravitated towards doing more OpenCL work, and even more recently, OpenCL-and-CUDA together (like this dual-ecosystem single kernel runner)

In my limited experience with the OpenCL C++ bindings, they have several features which are not to my liking - not bugs, but design choices which, once made, are difficult or impossible to unmake and change. Some examples:

Resource-holding classes are not RAII/CADRe - having a more complex state, but easier to construct and copy
Extensive use of macros
A single-header-file rather than many-files approach
Use of error-code return values and output parameters, as opposed to assuming success and relying on the availability of exceptions
Use of concrete container classes as opposed to templating over containers or iterators.
No use of std::optional's

Most of these are different choices than the ones I've made in wrapping CUDA's APIs. Again, to clarify - I'm not listing all of that to disparage, just to illustrate the design space has points which are quite far away from each other.

So, in light of the above - I was wondering if you are aware of initiatives to devise different C++ bindings with significantly different design choices (and thus incompatible with these bindings). Alternatively - whether you are aware of individuals or groups who might be interested in collaboration on such alternative bindings.

@eyalroz Thank you for taking the time to provide feedback. While the OpenCL workgroup has discussed briefly the issue you opened (which I believe we should turn into a discussion) it was decided that we all need to sleep on it a few times before concluding on anything. In the meantime, allow me to tune in with my personal remarks.

I'm not quite sure which kind of resource class you find missing the RAII principle. Taking cl::Buffer as an example, as the most resource-heavy API object, CTORs and DTORs properly handle object reference counts. All API objects are ref counted, and in that sense behave akin to std::shared_ptr of the handled object type. This is the behavior the OpenCL C API exposes and these wrappers don't do anything else (wrt. lifetime management) than add value semantics to ref counting. Resources related to objects are released by the underlying OpenCL runtime. The wrappers cannot really interfere with that.
- If one would want to associate some other form resource lifetime semantics to the wrapper objects, a lightweight runtime needs to be built on top of the OpenCL C API, which these wrappers never intended on doing. Other wrappers may do something similar and I've already expressed my personal opinion on this matter. Adding RAII semantics on top of OpenCL C API which captures something else than the underlying reference count is aking to building an engine on top of OpenCL. It could be valuable, but it's another abstraction on top.
I'm not a big fan of macros either. Macros changing the contents of the headers make it hard to expose the wrappers as a C++ module, because there won't be 1! module to ship, but cl.300.exceptions.il_program etc. These optional features of the wrapper are suboptimal in a sense, but things such as the presence/absence of exceptions are something various consumers of OpenCL are adamant about, either because of the simplicity they provide or simply because the target architectures can't even execute excpetions or aren't forgiving to their presence.
The C++ wrapper is a single file. The base C headers may be unified, but I don't see that happening anytime soon. It would require a a lot of plumbing to change that. Honestly, I don't see this as a burden in any regard.
This (IMHO) is the right approach. I've never seen a comprehensive comparison of real-life, even just mildly complicated program written twice: once with exceptions and once without and show that the runtime of the "happy paths" would differ significantly. Throwing exceptions is cheap, catching them is costly (that's where the dynamic type casting happens) Because OpenCL errors typically can't be handled by programs (others than logging and terminating), this choice sanitizes the happy path to contain minimal logic and no error handling noise.
- Should you actually expect something to fail and react to it in a meaningful way, in those handful of occurences, providing an error code pointer will cause the wrappers to report the error and not throw. In my experience this is extremely rarely needed.
The use of customizable containers used by the API (through macros) and the use of iterators makes it flexible enough to interface cleanly with just about any code. The few cases where this is violated actually hurts perf.
- Take cl::CommandQueue::enqueueAnything where the event wait list is taken as an lvalue reference to an actual cl::vector instance. In perf critical scenarios this forces you to persist your event arrays or create them at the price of a malloc. They can't be stack-allocated arrays or anything else. Forcing the use of a particular container at times causes such limitations.
The C++ wrappers pre-date C++11 even. The current opencl.hpp is a rebrand of cl2.hpp which pre-dates C++14. Propagating std::optional through the wrappers either adds another dimension to the optional macro features to which the wrappers react or be a braking change.

All in all yes, these are decisions that could be revisited, given the age of the wrappers, but supporting the current C++ API for backwards compatibility will most certainly be needed. The breaking change of cl.hpp to cl2.hpp was already annoying enough for users of OpenCL. My guess is that whatever significant enough change is to be made, it will be an additive change on top of whatever exists currently. Overall, most design decisions (IMHO) were the right ones at the time. I too would be interested how different the API would look if one would draw the line at C++17 or C++20 minimum.

I'll be sure to ask around if there are similar initiatives around (Moritz Lehmann's being one example from the linked Twitter encounter) and while I don't have bandwidth to contribute, I'm most certainly interested in alternative, modern encapsulations of OpenCL.

(which I believe we should turn into a discussion)

That can mean one of several things: A discussion here, a discussion over some group communications platform, etc.

it was decided that we all need to sleep on it a few times before concluding on anything.

I'm glad that you're taking it this seriously... if it helps, I can make myself avaialble for (informal) chatting before you do anything as a group.

I'm not quite sure which kind of resource class you find missing the RAII principle. Taking cl::Buffer as an example, as the most resource-heavy API object, CTORs and DTORs properly handle object reference counts.

Ah, but a cl::Buffer can live with a reference count of 0. In fact, it has a default constructor: cl::Buffer my_buffer; works. That means it is not a "constructor allocates, destructor releases" type. And this is true for multiple wrapper classes, like CommandQueue, Context etc. And while you can construct these wrappers while creating the associated OpenCL resource, the fact that you don't have to means that any code that takes any of these wrapper objects cannot assume they actually stand for anything.

things such as the presence/absence of exceptions are something various consumers of OpenCL are adamant about, either because of the simplicity they provide or simply because the target architectures can't even execute exceptions or aren't forgiving to their presence.

It's true that some architectures don't support exceptions; but I wonder how many CPU architectures can manage OpenCL devices on the one hand but not exceptions on the other. Regardless - exceptions are a choice that I've made with my wrappers - as all platforms supporting CUDA also support exceptions; and they do allow for simplifying the API in many respects. My interest in CUDA-OpenCL API commonality or bridging is restricted to those platforms where both execution ecosystems are supported, so I have the luxury of making that choice.

BTW - Even without exceptions, I would probably go for something like std::expected (or an equivalent thereof for earlier C++ versions) rather than returning just a status code.

The C++ wrapper is a single file. The base C headers may be unified, but I don't see that happening anytime soon. It would require a a lot of plumbing to change that. Honestly, I don't see this as a burden in any regard.

That point is merely a matter of coding convenience...

This (IMHO) is the right approach.

Since we both used bullet points rather than numbers, I'm not sure which "this" you referred to :-P

The use of customizable containers used by the API (through macros) and the use of iterators makes it flexible enough to interface cleanly with just about any code. The few cases where this is violated actually hurts perf.

Some containers are not customizable, and are simply std::vector's. Examples: getInfoHelper(), Platform::getDevices(), Platform::getPlatforms(). Allowing for different containers here would not have any effect on performance, since these are called extremely infrequently, maybe just once. Also, customizing things with macros (mostly) prevents reasoning about the types at compile-time, and then there's the question of what happens if one library is compiled with one macro in the headers and another with a different macro.

Take cl::CommandQueue::enqueueAnything where the event wait list is taken as an lvalue reference to an actual cl::vector instance. In perf critical scenarios this forces you to persist your event arrays or create them at the price of a malloc. They can't be stack-allocated arrays or anything else. Forcing the use of a particular container at times causes such limitations.

So, about performance again - this kind of call should not happen in a perf-critical part of one's code. Calling CUDA or OpenCL API's, going to the driver etc. is pretty slow, in relative terms - tens of microseconds in typical cases. If enqueuing performance is somehow an issue for me, I would think something's wrong with my program (or that I need to use CUDA graphs; or both).

All in all yes, these are decisions that could be revisited, given the age of the wrappers, but supporting the current C++ API for backwards compatibility will most certainly be needed.

Certainly, I'm not denying that; that's why I didn't suggest that the API be revamped, and was just wondering about an alternative, developed in parallel.

I too would be interested how different the API would look if one would draw the line at C++17 or C++20 minimum.

You can already go pretty far with C++11, but C++17 or 20 would at least clean up my wrappers a bit more and reduce the need for partially replicating some missing standard library constructs, like a "poor man's span" or "poor man's string view" and such.

I'll be sure to ask around if there are similar initiatives around (Moritz Lehmann's being one example from the linked Twitter encounter)

I noticed Mortiz Lehmann's repository, but I had a similar criticism to the one made by Nagy-Egri Mate: Lehmann's code is idiosyncratic and makes many assumptions about what the wrappers' user wants or needs. It also mixes up functionality which belongs in separate places in the code. The wrappers are also not really lightweight IMHO. And - they make no attempt at covering the entire OpenCL API (not even for a specific version). So, my wrappers are really not at all like Moritz Lehmann's initiative...

I'm most certainly interested in alternative, modern encapsulations of OpenCL.

Then perhaps you would be interested in checking out some of the example programs for the CUDA API. Naturally, there are a bunch of significant differences, but it would give you a sense of the kind of code I want to enable users to write. If you want just one thing to start with, then perhaps this vectorAdd variant.

Another dimension to the discussion about OpenCL & modern C++ is SYCL https://www.khronos.org/sycl/ Using SYCL interoperability mode allows SYCL to act as a C++ binding to the underlying back-end, OpenCL, CUDA, whatever. There was a presentation at the last SYCLcon about it:

Aksel ALPAY, Thomas APPLENCOURT, Gordon BROWN, Ronan KERYELL and Gregory LUECK. « Using interoperability mode in SYCL 2020. » In SYCLcon 2022: International Workshop on SYCL. Association for Computing Machinery, May 2022. doi:10.1145/3529538.3529997. https://www.iwocl.org/wp-content/uploads/39-presentation-iwocl-syclcon-2022-aksel.pdf, https://www.youtube.com/watch?v=XIPhuesdqYE.

@keryell : I have to respectfully disagree. SYCL gives you a simplified higher-level abstraction; it cannot serve as C++ binding or wrappers for OpenCL (or CUDA).

Don't get me wrong - SYCL is quite useful when one cannot spare the effort to get into the nitty-gritty; and for many straightforward use-cases - but what I'm talking about is lower-level than SYCL and would cover (almost?) all of the OpenCL functionality.

I am not saying that SYCL is the way to go but I wanted just to expose some ways to use C++ and OpenCL. The SYCL committee is always happy to get feedback about how to improve the low-level programming of the backend like OpenCL, even if the main goal is higher-level, as you said. In your https://github.com/eyalroz/cuda-api-wrappers/blob/development/examples/modified_cuda_samples/vectorAdd_nvrtc/vectorAdd_nvrtc.cpp example, in SYCL the code could be written in a very similar way, except that the OpenCL C or CUDA driver API kernel compilation is outside of the scope of SYCL today as relying on the back-end for that. At least an API like what you propose could be used here.

@keryell So, combining an API like what I suggest with SYCL... yes, that does sound useful. But that would be jumping a few large steps ahead...

KhronosGroup / OpenCL-CLHPP

Question: Work on different-design-choices version of the C++ bindings? #179