exo-lang / exo

Exocompilation for productive programming of hardware accelerators
https://exo-lang.dev
MIT License
307 stars 28 forks source link

GPU support in Exo #547

Open skeqiqevian opened 11 months ago

skeqiqevian commented 11 months ago

Wanted to preface by saying that this design is definitely not done. However, I wanted to describe the proposal based on discussions with William and get some feedback, since I'll be on vacation next week.  

ExoIR representation of CUDA abstractions

CUDA has three primary abstractions that we want to support: memories, parallel hierarchy, and synchronization. We first describe how we represent these in ExoIR. Later, we will describe the necessary safety checks to prevent users from generating bad CUDA code. Ideally, we want to prevent both data races and deadlocks.

Parallel Hierarchy

We will represent parallel block/thread loops as parallel loops with special annotations (e.g. @THREADS or @BLOCK). In CUDA, these loops are always implicit because users are tasked with writing thread programs. In our programming model, we require users to explicitly write parallel loops. Users may write consecutive parallel loops, e.g. the following correspond to running threads 0-7 all in parallel:

for i in par(0, 4): @THREADS
    ...
for i in par(4, 8): @THREADS
    ...

Code generation from this programming model to the CUDA programming model is simple. Each such block/thread loop actually corresponds to an if statement which predicates over the specified loop range:

CUDA C++ Exo
```cpp if (blockIdx.x < 5) { if (threadIdx.x < 5) { ... } } ``` ```python for i in par(0, 5) @ BLOCKS: for i in par(0, 5) @ THREADS: ... ```

Excessive generation of if statements

This approach may generate vacuously true if statements (e.g. when iterating over all threads), so we should prune those. In particular, unless there is complex block-level synchronization, all of the block-level loops will likely generate vacuously true if statements.

Memory

We will define specialized memory classes for the shared memory (SHARED_MEMORY) and thread-local registers (THREAD_REGISTER), just as we did for AVX vector registers. These memories we require additional checks:

Synchronization

We want to give users control over synchronization. Thus, it is the user's responsibility to properly insert synchronization primitives into their code. At compilation time, we will verify that the user inserted syncs properly before generating CUDA code. In CUDA code, we can perform synchronization over arbitrary predicates (like below). However in Exo, we will need to restrict ourselves to predicates of index expressions. As a design choice to avoid reasoning about complicated synchronization patterns, we choose to make synchronizations happen outside of the parallel for loops. Thus, Exo code will

CUDA C++ Exo
```cpp cuda::barrier bar; bar.init(512); if (predicate(threadIdx.x)) { bar.arrive(); } ``` ```python bar: barrier[512] sync(bar, predicate) ```

To avoid deadlocking, we need to check that the specified number of threads arrives at the barrier for an arbitrary predicate. To start out, perhaps we should restrict the predicates to simple ranges, e.g. [lo, hi].

Safety checks

Memory safety

Our proposed programming model doesn't require an entire thread program to be in a single loop over threads, so it's possible for situations where thread-level registers persist across multiple thread loops, e.g.

reg: i32[8]
for i in par(0, 8) @ THREADS:
    # do something with reg
sync(bar)
for i in par(0, 8) @ THREADS:
    # do something with reg

Therefore, the thread registers may be allocated external to the thread loops. When that happens, the first dimension should be the number of threads. Furthermore, we need to check that each thread only reads from its own registers. We will need to do similar analysis for shared memory and @BLOCK for loops.

Parallel safety

We consider a pair of threads to be non-interfering if each thread's write set is independent of the other thread's read/write sets. Race conditions are not possible between non-interfering threads because they write to disjoint memories (they may still read from shared read-only memory). Such "embarrassingly parallel" code does not require any synchronization. Below are some examples of non-interfering parallel threads:

for i in par(0, N) @ THREADS:
    b[i] = a[i] + a[i + 1]
for i in par(0, N/2) @ THREADS:
    a[i] = 1.0
for i in par(N/2, N) @ THREADS:
    a[i] = 0.0

Exo's existing analysis for OpenMP parallelism performs this exact check. However, it currently assumes that the parallel loops exist in the outermost scope. We need to extend this approach to nested parallel loops and synchronization.

Proposed analysis

Disclaimer: I don't currently know the specifics of implementing such an analysis. I'll need to talk with Yuka and Gilbert to better understand what they are doing with Abstract Interpretation. But I think this describe the high-level of the kind of checks we need to perform.

We require users to insert synchronization into their code to break the code into sections of non-interference. The analysis needs to verify that in between synchronizations, threads are non-interfering. To do so, for each thread, we track the memory locations that it can access safely. As we iterate through the program:

Analysis Example

As an example, consider the following program:

bar: Barrier(threads=4)
for i in par(0, 4):
    a[i] = ...
sync(bar, "i < 4")
for i in par(0, 4):
    b[i] = a[i]
    if i+1 < 4:
        b[i] += a[i+1]

The analysis progression would update the accessible memory locations as follows:

Initially, all memories are accessible by all threads.

thread  memories
0       a[...], b[...]
1       a[...], b[...]
2       a[...], b[...]
3       a[...], b[...]

After first loop, the a[i]s are exclusive because they are written to.

thread  memories
0       a[0], b[...]
1       a[1], b[...]
2       a[3], b[...]
3       a[3], b[...]

After sync, all the a[i]s are no longer exclusive.

thread  memories
0       a[...], b[...]
1       a[...], b[...]
2       a[...], b[...]
3       a[...], b[...]

After second loop, none of the as are affected because those were read-only memories. However, the b[i]s are now exclusive.

thread  memories
0       a[...], b[0]
1       a[...], b[1]
2       a[...], b[2]
3       a[...], b[3]

Implementation - Not sure yet

The above analysis is doable for simple programs, but I'm less sure of how to extend it to more complicated programs with more degrees of loop nesting. Below is an example of a fairly complicated program (warp specialization) that we would want our analysis support.

CUDA C++ Exo
```cpp for (int i = 0; i < n_iters; i++) { if (threadIdx.x < 32) { produce(buf[i%2]); } __syncthreads(); if (32 <= threadIdx.x) { consume(buf[i%2]); } } ``` ```python for i in seq(n_iters): for j in par(0, 32) @ THREADS: produce(buf[i%2]) sync(0, n_threads) # producers done, consumers ready to receive for j in par(32, n_threads) @ THREADS: consume(buf[i%2]) ```

Sidenote: Exo currently can't schedule circular buffer optimizations, which would be necessary for the software pipelining which enables this producer-consumer model.

More examples of ExoIR

Taken from CUDA C++ Programming Guide 7.26.2.

CUDA C++ Exo
```cpp __global__ void split_arrive_wait(int iteration_count, float *data) { using barrier = cuda::barrier; __shared__ barrier bar; auto block = cooperative_groups::this_thread_block(); if (block.thread_rank() == 0) { init(&bar, block.size()); // Initialize the barrier with expected arrival count } block.sync(); for (int curr_iter = 0; curr_iter < iteration_count; ++curr_iter) { /* code before arrive */ barrier::arrival_token token = bar.arrive(); /* this thread arrives. Arrival does not block a thread */ compute(data, curr_iter); bar.wait(std::move(token)); /* wait for all threads participating in the barrier to complete bar.arrive()*/ /* code after wait */ } } ``` ```python bar: CUDA_BARRIER @ SHARED_MEMORY for i in par(0, 1): @ THREADS init(&bar, ...) sync(0, N) for i in par(0, N): @ THREADS # code before arrive sync(0, N) for i in par(0, N): @ THREADS compute(...) sync(0, N) for i in par(0, N): @threads # code after wait ```
gilbo commented 10 months ago

Hi Kevin,

Here are the two big comments/questions:

  1. Is the plan that this will be developed in a fork from Exo for the purposes of a Master's project or is this a proposal for a core Exo feature? Maybe a fork would give you more freedom to quickly start experimenting, and not get you stuck in trying to have a consistent design. For example, you could just create an "Exo-CUDA" language instead of worrying about how to externalize the CUDA programming model.
  2. My original understanding of this project was that the idea was to explore very conservative (w.r.t. concurrency/synchronization) GPU support for Exo. The sketch here seems to be headed towards supporting synchronization primitives, which will be non-trivial because it could invalidate the soundness of all currently existing analyses/program transforms. If it doesn't, it's unclear to me exactly why it doesn't.

You may want to look at something called the Bulk Synchronous Parallel (BSP) model as a potential grounding for what you are proposing.

Here are some further comments jotted down while reading:

yamaguchi1024 commented 10 months ago
yamaguchi1024 commented 10 months ago

Notes from a discussion with Kevin and William: