diku-dk / futhark

:boom::computer::boom: A data-parallel functional programming language
http://futhark-lang.org
ISC License
2.36k stars 165 forks source link

Should we use CUDA terminology in the compiler? #2062

Closed athas closed 7 months ago

athas commented 7 months ago

The first GPU backend was for OpenCL, so the compiler internals consistently uses OpenCL terminology such as "workgroup" and "local memory". But with the advent of HIP, even AMD seems to prefer CUDA terminology (thread block, shared memory). Should we follow along? Realistically, most people with GPU knowledge will be familiar with the GPU terms. This is also what @coancea teaches, and he trains a significant fraction of our ~forced labour~ student contributors.

sortraev commented 7 months ago

Yes, I think so! It is my experience that most new literature and most people I talk to prefer CUDA terminology.

But we should acknowledge the fact that there is also ambiguity in CUDA terminology, especially considering that the change isn't going to erase OpenCL terminology from people's minds. As an example, CUDA's notion of "local memory" or "thread local memory" has nothing to do with CUDA shared or OpenCL local memory, and can cause confusion with even familiar CUDA/OpenCL programmers if not explicitized with e.g. "thread-local global memory". Also, "block" is going to be an overloaded term in some parts of the compiler, and I'm curious whether there may be other overlaps(?)

athas commented 7 months ago

The term "block" must never be used by itself. It is the most overloaded term in computer science. That goes for the compiler as well as conversation.

Munksgaard commented 7 months ago

Though it saddens me, it does indeed seem to be the way the world is moving.

Oblomov commented 7 months ago

My opinion carries no weight since I'm not directly involved in the project (I just follow its development from the outside), but I have been doing and teaching GPU programming for over a decade now and I think it would be better, especially for projects with fully heterogeneous parallel computing support, to stick to the OpenCL and SYCL terminology.

A work-item is not an actual thread on most backends, and technically not even in CUDA (or HIP, that is designed to copy it), where “CUDA thread” is a confusing misnomer mostly adopted for marketing purposes rather than out of technical consideration about its meaning.

Work-group is also distinctly superior as term, if not else for not using the overload “block” term. (FWIW, “block” is such a poor choice that even NVIDIA doesn't use it internally, preferring CTA, Cooperative Thread Array.)

The divergence of meaning for local memory between OpenCL and CUDA is a bit of a problem, but I can tell you from years of teaching experience that shared can be confusing too, especially for people that have a minimal knowledge of parallel computing, because the entire global memory is “shared” (in the sense of shared-memory parallelism). I don't know if and how this affects the Futhark internals, but I've personally started to use more specific expressions such as group-local memory or even the architectural acronym LDS (Local Data Share) for OpenCL local memory/CUDA shared memory, and plainly “register spills” for CUDA local memory.

athas commented 7 months ago

I do agree that the OpenCL terminology is superior in most cases. The main exceptions are probably "NDRange" instead of "grid", and things that OpenCL has no good terminology for ("warp"). I also like OpenCL's symmetry in how everything that exists at the group level is called "local" (local ID, local memory), where NVIDIA's terminology is less than optimal, as you point out.

But after years of experience, I note that CUDAs nomenclature is just so dominant in the community that our insistence on using OpenCL terminology internally simply causes confusion. Or worse: inconsistency.

I think the best argument for sticking to OpenCL terminology is actually that the graphics APIs (like Vulkan) are closer to OpenCL than to CUDA.