Block Tiling Using Shared Memory

We will use three simplified examples from n-body, lavaMD and Matrix-Matrix Multiplication benchmarks. The notation of streamSeq might differ from what is curently in Futhark -- I keep around the chunk size and the current index k of streamSeq, i.e., the next chunk starts at array element k and ends at array element k+chunk-1.

I assume the analysis is run on extracted kernels, i.e., which can be seen as a perfect map nest that is going to be executed in parallel on GPU. I also assume that map, redomap, scan, filter, streamMap/Red, etc., appearing inside the kernel have been turned to StreamSeq constructs.

1. How to Recognize the Beneficial Cases for Block Tiling?

In essence block-tiling is beneficial whenever we can detect a streamSeq whose input array is invariant to one of the parallel dimensions of the kernel.

(A) n-body

fun [f32, n] main(f32 eps, [f32, n] bodies) = 
map( fn f32 (f32 body) =>  -- 1-dim Kernel
            streamSeq( fn f32 (int kk, [f32, chunk] bodies_chunk, f32 acc) =>
                            loop (acc) = for k < chunk do
                                acc + (bodies_chunk[i] - body)
                            in acc
                     , bodies, 0f32)
   , bodies)

n-body is the simplest case demonstrating such an access: I assume the kernel is one-dimensional and contains only the outermost map which is fed array bodies. However, array bodies is also fed to the streamSeq that forms the kernel body. Assuming the one-dimensional block size is TILE (e.g., 128, 256, etc.), one case choose chunk = TILE, and just before the loop make the block threads cooperatively write the current TILE-chunk of bodies-chunk to fast (local/shared) memory such that inside the loop all accesses to bodies_chunk are from fast memory (rather than global memory, even if, in this case, it might be efficiently cached).

(B) lavaMD

fun [[f32, m], n] main([[f32, m], n] AA, [[int, p],n] inds_nhbs) = 
map( fn [f32,m] ([f32,m] A, int i) => -- Kernel's first parallel dimension
            map( fn f32 (f32 a) =>         -- Kernel's second parallel dimension 
                    streamSeq( fn f32 (int kk1, [int, chunk1] inds_nhbs, f32 acc1) =>
                            loop (acc1) = for k1 < chunk1 do
                                let ind = inds_nhbs[i, k1] in
                                let invar_AArow = AA[ind] in
                                streamSeq( fn f32 (int kk2, [int, chunk2] invar_AArow_chunk, f32 acc2) =>
                                            loop (acc2) = for k2 < chunk2 do
                                                    acc2 + invar_AArow_chunk[k2] * 2.0f32
                                            in acc2
                                         , invar_AArow, acc1)
                            in acc1
                    , iota(inds_nhbs), 0f32)
               , A)
   , zip(AA, iota(n)))

lavaMD presents a slightly more complex case that basically demonstrates that we cannot rely on simple pattern-matching, but rather we need to do a full pass of invariant-array detection. In the example above invar_AArow is invariant to the second (map) parallel dimension of the kernel. Note that this array is created somewhere deep inside the kernel, but it is fed as input to the second streamSeq and as such can be blocked tiled, i.e., used directly from shared memory.

Note that this requires a 2-D grid, but only a 1-dim block (i.e., the second map is split in one or a number of blocks), and the TILE size can be the dimension of the 1-d block. This would allow to amortize one global-memory access over TILE accesses from fast memory (related to invar_AArow_chunk).

Note however that if the invariant map would have been the first one, block-tiling would have required to interchange the parallel dimensions, i.e., bring it innermost because fast memory is shared only within the same workgroup.

(C) MM-Mult

fun [[f32,q],m] main([[f32,n],m] A, [[f32,q],n] B0) =
  let B = transpose(B0) in
  map(fn [f32,q] ( [f32,n] a ) =>       -- Kernel's first parallel dimension
            map(fn f32 ( [f32,n] b ) =>  -- Kernel's second parallel dimension
                    streamSeq(fn f32 (int kk, [f32,chunk] a_c, [f32,chunk] b_c, int acc) =>
                                  loop (acc) = for k < chunk do
                                          acc + a_c[k]*b_c[k]
                                  in acc
                             , a, b, 0f32 )
               , B) 
     , A )

The third example is the trickier one of matrix-matrix multiplication: here a is invariant to the second map, while b is invariant to the first map, and they are both fed to the streamSeq that makes the body. In order to cache both accesses in shared memory, one needs to create a 2-dimensional workgroup (so that accesses on both a and b can be shared).

How to Implement Tiling?

Once tiling opportunities have been detected, block tiling can be implemented by (i) adding a notion of fast (shared/local) memory to the Kernel, by (ii) inserting a special function call that performs the per-thread copying to fast memory just between the invariant streamSeq and its corresponding loop, and, finally, by (iii) replacing accesses to the invariant array with accesses from fast memory.

streamSeq provides a convenient notation for doing this because the indexing into array chunks starts at 0 and goes until chunk, hence exactly how we need to access fast memory! We use here a CUDA notation -- we will have to extend the kernel with some magic things such as a notion of block/workgroup-size (TILE) and indexing within a workgroup (threadIdx.x), and obviously, a notion of fast memory parameterized over the workgroup size, and a way to copy to fast memory, denoted below with copyFastMem.

(A) n-body

For n-body we assume a 1-dim block of size TILE, and as such i corresponds to blockIdx.x*blockDim.x + threadIdx.x. We choose the chunk of the invariant streamSeq to be TILE. The translated code becomes:

fun [f32, n] main(f32 eps, [f32, n] bodies) = 
map( fn f32 (int i, f32 body) =>  -- 1-dim Kernel; 
                                                  -- 1-dim workgroup/block of size TILE, threadIdx.x in [0,TILE-1]
                                                  -- fast memory: shMem_bodies[TILE]
            streamSeq( fn f32 (int kk, [f32, TILE] bodies_chunk, f32 acc) =>
                            let shMem_bodies = copyFastMem(bodies_chunk[threadIdx.x], threadIdx.x) in
                            loop (acc) = for k < TILE do
                                acc + (shMem_bodies[k] - body)
                            in acc
                     , bodies, 0f32)
   , zip(iota(n), bodies)

Note that with this bit of magic, when we finally translate a streamSeq to a loop we need to obviously extend the indexing to arrays such as bodies_chunk but not to fast-memory arrays, such as shMem_bodies. Note that copyFastMem should actually check that the index of the global-memory array is in range, and if it is not it should obviously not write to shared memory.

(B) lavaMD

We treat lavaMD similarly, but note that while the grid has semantically a 2-D structure, we need only use a 1-dim structure for the block/workgroup (of size TILE), such as one or multiple blocks cover the inner (second) map. This is potentially subject to multi-versioned code since this makes sense only if the second map is large enough! We assume for simplicity that TILE divides m, the size of the second map.

fun [[f32, m], n] main([[f32, m], n] AA, [[int, p],n] inds_nhbs) = 
map( fn [f32,m] ([f32,m] A, int i) => -- Kernel's first parallel dimension
            map( fn f32 (f32 a) =>         -- Kernel's second parallel dimension
                                                        -- 1-dim Block of size TILE, threadIdx.x in [0,TILE-1]
                                                        -- fast memory: shMem_AArow[TILE]
                    streamSeq( fn f32 (int kk1, [int, chunk1] inds_nhbs, f32 acc1) =>
                            loop (acc1) = for k1 < chunk1 do
                                let ind = inds_nhbs[i, k1] in
                                let invar_AArow = AA[ind] in
                                streamSeq( fn f32 (int kk2, [int, TILE] invar_AArow_chunk, f32 acc2) =>
                                            let shMem_AArow = copyFastMem(
                                                                                 invar_AArow_chunk[threadIdx.x], 
                                                                                 threadIdx.x ) in
                                            loop (acc2) = for k2 < TILE do
                                                    acc2 + shMem_AArow[k2] * 2.0f32
                                            in acc2
                                         , invar_AArow, acc1)
                            in acc1
                    , iota(inds_nhbs), 0f32)
               , A)
   , zip(AA, iota(n)))

(C) MM-Mult

This is the tricky case because is requires 2-dim tiling, i.e., need to create a 2-dim block/workspace such that the access to both arrays can be tiled in fast memory. To understand how the indexing is transformed, it is instructive to review the transformation first: Semantically the two maps that give the parallel dimensions of the kernel are strip-mined with a TILE factor and then the outer strip is interchanged inward, yielding the doall structure below, which corresponds to the original map-based program:

DOALL ii = 0, m-1, TILE      -- (ii/TILE) gives the workgroup number on Y dimension
    DOALL jj = 0, q-1, TILE   -- (jj/TILE) gives the workgroup number on X dimension
            DOALL i = ii, min(ii+TILE-1, m-1), 1    -- within workgroup: (i - ii) => threadIdx.y
                DOALL j = jj, min(jj+TILE-1, q-1), 1 -- within workgroup: (j - jj) => threadIdx.x
                        -- kernel code
                        acc = 0.0;
                        DO kk = 0, n-1, TILE
                            shMemA[i-ii, j-jj] = A[i, j - jj + kk];
                            shMemB[j-jj, i-ii] = B[j, i - ii + kk];
                            __sync_barrier_sh_mem();
                           DO k = kk, min(kk+TILE-1, n-1), 1
                               acc += shMemA[i-ii, k - kk] * shMemB[j-jj, k - kk]
                        ENDDO ENDDO
                        C[i,j] = acc
ENDDOALL ENDDOALL ENDDOALL ENDDOALL

Rewriting it as a Futhark kernel this would correspond to:

fun [[f32,q],m] main([[f32,n],m] A, [[f32,q],n] B0) =
  let B = transpose(B0) in
  map(fn [f32,q] (int i, [f32,n] a ) =>        -- Kernel's first parallel dimension
            map(fn f32 ( int j, [f32,n] b ) =>  -- Kernel's second parallel dimension
                                                              -- 2-dim Block of size TILE x TILE, 
                                                              -- threadIdx.x, threadIdx.y in [0,TILE-1]
                                                              -- fast memory: shMemA[TILE,TILE]
                                                              -- fast memory: shMemB[TILE,TILE+1]
                    streamSeq(fn f32 (int kk, [f32,chunk] a_c, [f32,chunk] b_c, int acc) =>
                                  let shMemA = copyFastMem( a_c[threadIdx.x+kk]
                                                                               , threadIdx.y, threadIdx.x ) in
                                  let shMemB = copyFastMem( b_c[threadIdx.y+kk]
                                                                               , threadIdx.x, threadIdx.y ) in
                                  loop (acc) = for k < chunk do
                                          acc += shMemA[threadIdx.y, k] * shMemB[threadIdx.x, k]
                                  in acc
                             , a, b, 0f32 )
               , zip(iota(q),B) ) 
     , zip(iota(m),A) )

Three Important Optimization Observations for Matrix-Matrix Multiplication (2-dim Tiling)

First, note that in the code above the access to global-memory array B in the imperative code and to array b_c in the Futhark-kernel code when copying to fast memory is UNCOALESCED, because its innermost index is threadIdx.y+kk, i.e., coalesced would require the innermost index to end in threadIdx.x! This will kill all the performance gained, i.e., is about 3.8x slower than what it should be. As such I am counting on the transpose to be delayed and to change the indexing into the coalesced one (and if for some reason the transpose is missing then matrix B should be transposed).

Second, note the inner dimension of shMemB is TILE+1 instead of TILE. This is because the (uncoalesced) access to shMemB (both read and write) will result in frequent bank-conflicts if the inner size is a multiple of 16, i.e., TILE. So making it TILE+1 alleviates that to a good extent. Note however that on my NVIDIA card, if you also make the inner dimension of shMemA to be TILE+1 this generates a 2x slowdown because of plummeting occupancy (I assume). Another workaround would be to also semantically transpose shMemB, but if too difficult, the TILE+1 inner dimension should do the trick.

Third, when doing block-tiling one should aim to maximize the workgroup/block size, since one access to global memory is amortized over TILE accesses to fast memory. This might be less important for one-dimensional tiling since the gain from 256 to 1024 amortization is probably small (diminishing returns). However for two-dimensional tiling it seems to make a difference: the matrix-matrix multiplication with TILE=32 is about 1.2x times faster than with TILE=16.

Thoughts on How to Achieve (Simplified) Parboil-like Tiling for Matrix-Matrix Multiplication

This is non-trivial, and requires that we sequentialize a tile-size factor corresponding to the tile of the second parallel dimension, but can be achieved in three logical steps:

The first step is two-dimensional tiling as before, but only the B accesses are optimized/mapped to shared memory (the tile sizes on each dimension do not have to be equal, e.g., T1 and T2).

The second step is to strip mine the innermost parallel dimension by T2, i.e., the map on B is rewritten as a nested streamPar-map, and the internal streamSeq SOACs are re-written to loops.

The third step is to move the stripmined-map to the innermost position, for example by applying kernel extraction (use map distribution and loop-map interchange), and finally to sequentialize it. Note that each thread now produces T2 elements -- these should be kept in private memory (register tiling), i.e., when it exhibits regular access, private memory is efficiently cached. Since the accesses to the A array are invariant to the strip-mined map on B, then one element of A, hold in a register will be used inside the stripmined map of B, hence amortizing global memory accesses for A.

diku-dk / futhark