chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.78k stars 420 forks source link

Improve CPU/GPU portability #23643

Open e-kayrakli opened 1 year ago

e-kayrakli commented 1 year ago

Improving CPU/GPU portability is something we want to start working on in the near term. This issue is to discuss general issues and ideas.

1. Overview of the challenges

a. Locale Model

As a very practical point to start thinking about this:

on here.gpus[0] {
  foreach i in 1..10 do ...
}

does not work on a system without a GPU or with flat locale model. Because in such scenarios you'd get OOB on gpus[0]. The user can workaround that in their application, but that's not ideal as we want the same code to be runnable on both GPU and CPU with minimal (ideally none) effort. There are two alternatives that I am aware:

b. Module Support

Delving more into GPU programming in Chapel, users can also be tripped by the functions in the GPU module. For which the behavior without GPUs are undefined (technical term for "I have no idea what would happen, you better not use them"). Most of those functions will be incorporated into existing Chapel features or new language features we'll need to develop. While waiting for that effort, should we make them more portable somehow?

This brings up the question about the expected behavior for some of them. e.g., createSharedArray, which doesn't really have a great concept to match in CPU-land. This brings us to the next subtopic:

c. Concepts/Terms

While different GPU vendors have different terminologies, typically, there's a good 1-to-1 correspondence between them (thread/work-item, warp/wavefront etc). What do each of these terms correspond to in CPU world? We need to set a conceptual execution model that can map to different hardware. This part is more abstract than others, but maybe coming up with a good strategy here first can help with the other parts of this issue.

2. Short-term action items

I think we can do the following in parallel.

a. Consider proc here.getGpuIfAny and iter here.getGpusIfAny

I think we can add this to the GPU module. Locale code is a bit finicky and special, which may create some issues. But I imagine these can take some sort of a "policy" argument that determine the behavior for when there's no GPU.

b. "Localerators"

We should have general-purpose, standalone iterators that yields locales. An immediate use case for that could be something that can do iter allGpus that would yield GPU locales with the correct affinity to their node. iter allLocales and yield both CPU and GPU locales. Similar to 2a, these could have a "policy" argument.

c. Improve the current state by proper error messages

The case in 1a probably would fail in weird ways if you don't have GPUs but use gpu locale model. We'd be using --no-checks, making it impossible to catch that oob access. Should we address that? If you use flat locale model with code like on here.gpus, should it just fail? At compile time? At execution time? Maybe warnings?

bradcray commented 10 months ago

Just noting that @psath had a case where he ran into issues like this when trying to execute his code using the flat locale model to do some profiling in this gitter thread. At the time, it made me wonder whether there were things we could do to make his code "simply work" in the flat locale model without the behavior being too surprising.

At the same time, I worry about the slippery slope here. I.e., I think it's a good thing that Locales[1] results in an OOB error when running on just a single locale and wouldn't want it to serve as a hint that could be ignored and cause execution to just silently fall back to Locales[0]. Something as imperative as on here.gpu[0] similarly feels like it could be confusing if the behavior were simply "you don't have GPUs, so I'm going to just run this on the CPU." Whereas something like on here.gpuWithFallback(0) would feel more OK, where I'm imagining this to be a method that would return the 0th GPU if there was one and use the CPU if not.

I'd also be much less concerned about things like having a gpu array in the flat locale model that is simply empty (e.g., var gpu: [1..0] ...;, causing for g in here.gpus be supported but result in a 0-iteration loop). It's the imperative nature of here.gpu[0] that feels worrisome to me (in terms of potential for confusion / moving away from our more imperative roots) if we just quietly turn that into something else.

e-kayrakli commented 10 months ago

I agree with all of that.

I noticed that I haven't written down what you can do today in simple scenarios for portability. In case someone ends up needing a quick recipe:

on if here.gpus.size>0 then here.gpus[0] else here {

}

Works nicely today. A compile time alternative is:

use ChplConfig;
on if CHPL_LOCALE_MODEL=="gpu" then here.gpus[0] else here {

}