KhronosGroup / OpenCL-CLHPP

Khronos OpenCL-CLHPP
Apache License 2.0
368 stars 129 forks source link

Question: Work on different-design-choices version of the C++ bindings? #179

Open eyalroz opened 2 years ago

eyalroz commented 2 years ago

Hello OpenCL people :-)

First - thank you and the consortium for maintaining this repository along with the C-ish APIs. It's important and useful.

I'm the author of cuda-api-wrappers (the Modern-C++ wrappers for the core CUDA APIs). Recently, I've gravitated towards doing more OpenCL work, and even more recently, OpenCL-and-CUDA together (like this dual-ecosystem single kernel runner)

In my limited experience with the OpenCL C++ bindings, they have several features which are not to my liking - not bugs, but design choices which, once made, are difficult or impossible to unmake and change. Some examples:

Most of these are different choices than the ones I've made in wrapping CUDA's APIs. Again, to clarify - I'm not listing all of that to disparage, just to illustrate the design space has points which are quite far away from each other.

So, in light of the above - I was wondering if you are aware of initiatives to devise different C++ bindings with significantly different design choices (and thus incompatible with these bindings). Alternatively - whether you are aware of individuals or groups who might be interested in collaboration on such alternative bindings.

MathiasMagnus commented 2 years ago

@eyalroz Thank you for taking the time to provide feedback. While the OpenCL workgroup has discussed briefly the issue you opened (which I believe we should turn into a discussion) it was decided that we all need to sleep on it a few times before concluding on anything. In the meantime, allow me to tune in with my personal remarks.

All in all yes, these are decisions that could be revisited, given the age of the wrappers, but supporting the current C++ API for backwards compatibility will most certainly be needed. The breaking change of cl.hpp to cl2.hpp was already annoying enough for users of OpenCL. My guess is that whatever significant enough change is to be made, it will be an additive change on top of whatever exists currently. Overall, most design decisions (IMHO) were the right ones at the time. I too would be interested how different the API would look if one would draw the line at C++17 or C++20 minimum.

I'll be sure to ask around if there are similar initiatives around (Moritz Lehmann's being one example from the linked Twitter encounter) and while I don't have bandwidth to contribute, I'm most certainly interested in alternative, modern encapsulations of OpenCL.

eyalroz commented 2 years ago

(which I believe we should turn into a discussion)

That can mean one of several things: A discussion here, a discussion over some group communications platform, etc.

it was decided that we all need to sleep on it a few times before concluding on anything.

I'm glad that you're taking it this seriously... if it helps, I can make myself avaialble for (informal) chatting before you do anything as a group.

  • I'm not quite sure which kind of resource class you find missing the RAII principle. Taking cl::Buffer as an example, as the most resource-heavy API object, CTORs and DTORs properly handle object reference counts.

Ah, but a cl::Buffer can live with a reference count of 0. In fact, it has a default constructor: cl::Buffer my_buffer; works. That means it is not a "constructor allocates, destructor releases" type. And this is true for multiple wrapper classes, like CommandQueue, Context etc. And while you can construct these wrappers while creating the associated OpenCL resource, the fact that you don't have to means that any code that takes any of these wrapper objects cannot assume they actually stand for anything.

things such as the presence/absence of exceptions are something various consumers of OpenCL are adamant about, either because of the simplicity they provide or simply because the target architectures can't even execute exceptions or aren't forgiving to their presence.

It's true that some architectures don't support exceptions; but I wonder how many CPU architectures can manage OpenCL devices on the one hand but not exceptions on the other. Regardless - exceptions are a choice that I've made with my wrappers - as all platforms supporting CUDA also support exceptions; and they do allow for simplifying the API in many respects. My interest in CUDA-OpenCL API commonality or bridging is restricted to those platforms where both execution ecosystems are supported, so I have the luxury of making that choice.

BTW - Even without exceptions, I would probably go for something like std::expected (or an equivalent thereof for earlier C++ versions) rather than returning just a status code.

  • The C++ wrapper is a single file. The base C headers may be unified, but I don't see that happening anytime soon. It would require a a lot of plumbing to change that. Honestly, I don't see this as a burden in any regard.

That point is merely a matter of coding convenience...

  • This (IMHO) is the right approach.

Since we both used bullet points rather than numbers, I'm not sure which "this" you referred to :-P

  • The use of customizable containers used by the API (through macros) and the use of iterators makes it flexible enough to interface cleanly with just about any code. The few cases where this is violated actually hurts perf.

Some containers are not customizable, and are simply std::vector's. Examples: getInfoHelper(), Platform::getDevices(), Platform::getPlatforms(). Allowing for different containers here would not have any effect on performance, since these are called extremely infrequently, maybe just once. Also, customizing things with macros (mostly) prevents reasoning about the types at compile-time, and then there's the question of what happens if one library is compiled with one macro in the headers and another with a different macro.

  • Take cl::CommandQueue::enqueueAnything where the event wait list is taken as an lvalue reference to an actual cl::vector instance. In perf critical scenarios this forces you to persist your event arrays or create them at the price of a malloc. They can't be stack-allocated arrays or anything else. Forcing the use of a particular container at times causes such limitations.

So, about performance again - this kind of call should not happen in a perf-critical part of one's code. Calling CUDA or OpenCL API's, going to the driver etc. is pretty slow, in relative terms - tens of microseconds in typical cases. If enqueuing performance is somehow an issue for me, I would think something's wrong with my program (or that I need to use CUDA graphs; or both).

All in all yes, these are decisions that could be revisited, given the age of the wrappers, but supporting the current C++ API for backwards compatibility will most certainly be needed.

Certainly, I'm not denying that; that's why I didn't suggest that the API be revamped, and was just wondering about an alternative, developed in parallel.

I too would be interested how different the API would look if one would draw the line at C++17 or C++20 minimum.

You can already go pretty far with C++11, but C++17 or 20 would at least clean up my wrappers a bit more and reduce the need for partially replicating some missing standard library constructs, like a "poor man's span" or "poor man's string view" and such.

I'll be sure to ask around if there are similar initiatives around (Moritz Lehmann's being one example from the linked Twitter encounter)

I noticed Mortiz Lehmann's repository, but I had a similar criticism to the one made by Nagy-Egri Mate: Lehmann's code is idiosyncratic and makes many assumptions about what the wrappers' user wants or needs. It also mixes up functionality which belongs in separate places in the code. The wrappers are also not really lightweight IMHO. And - they make no attempt at covering the entire OpenCL API (not even for a specific version). So, my wrappers are really not at all like Moritz Lehmann's initiative...

I'm most certainly interested in alternative, modern encapsulations of OpenCL.

Then perhaps you would be interested in checking out some of the example programs for the CUDA API. Naturally, there are a bunch of significant differences, but it would give you a sense of the kind of code I want to enable users to write. If you want just one thing to start with, then perhaps this vectorAdd variant.

keryell commented 2 years ago

Another dimension to the discussion about OpenCL & modern C++ is SYCL https://www.khronos.org/sycl/ Using SYCL interoperability mode allows SYCL to act as a C++ binding to the underlying back-end, OpenCL, CUDA, whatever. There was a presentation at the last SYCLcon about it:

Aksel ALPAY, Thomas APPLENCOURT, Gordon BROWN, Ronan KERYELL and Gregory LUECK. « Using interoperability mode in SYCL 2020. » In SYCLcon 2022: International Workshop on SYCL. Association for Computing Machinery, May 2022. doi:10.1145/3529538.3529997. https://www.iwocl.org/wp-content/uploads/39-presentation-iwocl-syclcon-2022-aksel.pdf, https://www.youtube.com/watch?v=XIPhuesdqYE.

eyalroz commented 2 years ago

@keryell : I have to respectfully disagree. SYCL gives you a simplified higher-level abstraction; it cannot serve as C++ binding or wrappers for OpenCL (or CUDA).

Don't get me wrong - SYCL is quite useful when one cannot spare the effort to get into the nitty-gritty; and for many straightforward use-cases - but what I'm talking about is lower-level than SYCL and would cover (almost?) all of the OpenCL functionality.

keryell commented 2 years ago

I am not saying that SYCL is the way to go but I wanted just to expose some ways to use C++ and OpenCL. The SYCL committee is always happy to get feedback about how to improve the low-level programming of the backend like OpenCL, even if the main goal is higher-level, as you said. In your https://github.com/eyalroz/cuda-api-wrappers/blob/development/examples/modified_cuda_samples/vectorAdd_nvrtc/vectorAdd_nvrtc.cpp example, in SYCL the code could be written in a very similar way, except that the OpenCL C or CUDA driver API kernel compilation is outside of the scope of SYCL today as relying on the back-end for that. At least an API like what you propose could be used here.

eyalroz commented 2 years ago

@keryell So, combining an API like what I suggest with SYCL... yes, that does sound useful. But that would be jumping a few large steps ahead...