Open skeqiqevian opened 11 months ago
Hi Kevin,
Here are the two big comments/questions:
You may want to look at something called the Bulk Synchronous Parallel (BSP) model as a potential grounding for what you are proposing.
Here are some further comments jotted down while reading:
THREAD_REGISTER
memory type; don't you merely want at GPU_DRAM
memory?@threads
, @openmp
, and so on) and how that will affect the backend code generationsync(..)
), where users can define their own backend checks externally. It will also be useful for handling prefetch since it's also a no-op.Notes from a discussion with Kevin and William:
Wanted to preface by saying that this design is definitely not done. However, I wanted to describe the proposal based on discussions with William and get some feedback, since I'll be on vacation next week.
ExoIR representation of CUDA abstractions
CUDA has three primary abstractions that we want to support: memories, parallel hierarchy, and synchronization. We first describe how we represent these in ExoIR. Later, we will describe the necessary safety checks to prevent users from generating bad CUDA code. Ideally, we want to prevent both data races and deadlocks.
Parallel Hierarchy
We will represent parallel block/thread loops as parallel loops with special annotations (e.g.
@THREADS
or@BLOCK
). In CUDA, these loops are always implicit because users are tasked with writing thread programs. In our programming model, we require users to explicitly write parallel loops. Users may write consecutive parallel loops, e.g. the following correspond to running threads0-7
all in parallel:Code generation from this programming model to the CUDA programming model is simple. Each such block/thread loop actually corresponds to an
if
statement which predicates over the specified loop range:Excessive generation of if statements
This approach may generate vacuously true
if
statements (e.g. when iterating over all threads), so we should prune those. In particular, unless there is complex block-level synchronization, all of the block-level loops will likely generate vacuously trueif
statements.Memory
We will define specialized memory classes for the shared memory (
SHARED_MEMORY
) and thread-local registers (THREAD_REGISTER
), just as we did for AVX vector registers. These memories we require additional checks:Synchronization
We want to give users control over synchronization. Thus, it is the user's responsibility to properly insert synchronization primitives into their code. At compilation time, we will verify that the user inserted syncs properly before generating CUDA code. In CUDA code, we can perform synchronization over arbitrary predicates (like below). However in Exo, we will need to restrict ourselves to predicates of index expressions. As a design choice to avoid reasoning about complicated synchronization patterns, we choose to make synchronizations happen outside of the parallel for loops. Thus, Exo code will
To avoid deadlocking, we need to check that the specified number of threads arrives at the barrier for an arbitrary predicate. To start out, perhaps we should restrict the predicates to simple ranges, e.g.
[lo, hi]
.Safety checks
Memory safety
Our proposed programming model doesn't require an entire thread program to be in a single loop over threads, so it's possible for situations where thread-level registers persist across multiple thread loops, e.g.
Therefore, the thread registers may be allocated external to the thread loops. When that happens, the first dimension should be the number of threads. Furthermore, we need to check that each thread only reads from its own registers. We will need to do similar analysis for shared memory and
@BLOCK
for loops.Parallel safety
We consider a pair of threads to be non-interfering if each thread's write set is independent of the other thread's read/write sets. Race conditions are not possible between non-interfering threads because they write to disjoint memories (they may still read from shared read-only memory). Such "embarrassingly parallel" code does not require any synchronization. Below are some examples of non-interfering parallel threads:
Exo's existing analysis for OpenMP parallelism performs this exact check. However, it currently assumes that the parallel loops exist in the outermost scope. We need to extend this approach to nested parallel loops and synchronization.
Proposed analysis
Disclaimer: I don't currently know the specifics of implementing such an analysis. I'll need to talk with Yuka and Gilbert to better understand what they are doing with Abstract Interpretation. But I think this describe the high-level of the kind of checks we need to perform.
We require users to insert synchronization into their code to break the code into sections of non-interference. The analysis needs to verify that in between synchronizations, threads are non-interfering. To do so, for each thread, we track the memory locations that it can access safely. As we iterate through the program:
Analysis Example
As an example, consider the following program:
The analysis progression would update the accessible memory locations as follows:
Initially, all memories are accessible by all threads.
After first loop, the
a[i]
s are exclusive because they are written to.After sync, all the
a[i]
s are no longer exclusive.After second loop, none of the
a
s are affected because those were read-only memories. However, theb[i]
s are now exclusive.Implementation - Not sure yet
The above analysis is doable for simple programs, but I'm less sure of how to extend it to more complicated programs with more degrees of loop nesting. Below is an example of a fairly complicated program (warp specialization) that we would want our analysis support.
Sidenote: Exo currently can't schedule circular buffer optimizations, which would be necessary for the software pipelining which enables this producer-consumer model.
More examples of ExoIR
Taken from CUDA C++ Programming Guide 7.26.2.