We will use three simplified examples from n-body, lavaMD and Matrix-Matrix Multiplication benchmarks. The notation of streamSeq might differ from what is curently in Futhark -- I keep around the chunk size and the current index k of streamSeq, i.e., the next chunk starts at array element k and ends at array element k+chunk-1.
I assume the analysis is run on extracted kernels, i.e., which can be seen as a perfect map nest that is going to be executed in parallel on GPU. I also assume that map, redomap, scan, filter, streamMap/Red, etc., appearing inside the kernel have been turned to StreamSeq constructs.
1. How to Recognize the Beneficial Cases for Block Tiling?
In essence block-tiling is beneficial whenever we can detect a streamSeq whose input array is invariant to one of the parallel dimensions of the kernel.
(A) n-body
fun [f32, n] main(f32 eps, [f32, n] bodies) =
map( fn f32 (f32 body) => -- 1-dim Kernel
streamSeq( fn f32 (int kk, [f32, chunk] bodies_chunk, f32 acc) =>
loop (acc) = for k < chunk do
acc + (bodies_chunk[i] - body)
in acc
, bodies, 0f32)
, bodies)
n-body is the simplest case demonstrating such an access: I assume the kernel is one-dimensional and contains only the outermost map which is fed array bodies. However, array bodies is also fed to the streamSeq that forms the kernel body. Assuming the one-dimensional block size is TILE (e.g., 128, 256, etc.), one case choose chunk = TILE, and just before the loop make the block threads cooperatively write the current TILE-chunk of bodies-chunk to fast (local/shared) memory such that inside the loop all accesses to bodies_chunk are from fast memory (rather than global memory, even if, in this case, it might be efficiently cached).
(B) lavaMD
fun [[f32, m], n] main([[f32, m], n] AA, [[int, p],n] inds_nhbs) =
map( fn [f32,m] ([f32,m] A, int i) => -- Kernel's first parallel dimension
map( fn f32 (f32 a) => -- Kernel's second parallel dimension
streamSeq( fn f32 (int kk1, [int, chunk1] inds_nhbs, f32 acc1) =>
loop (acc1) = for k1 < chunk1 do
let ind = inds_nhbs[i, k1] in
let invar_AArow = AA[ind] in
streamSeq( fn f32 (int kk2, [int, chunk2] invar_AArow_chunk, f32 acc2) =>
loop (acc2) = for k2 < chunk2 do
acc2 + invar_AArow_chunk[k2] * 2.0f32
in acc2
, invar_AArow, acc1)
in acc1
, iota(inds_nhbs), 0f32)
, A)
, zip(AA, iota(n)))
lavaMD presents a slightly more complex case that basically demonstrates that we cannot rely on simple pattern-matching, but rather we need to do a full pass of invariant-array detection. In the example above invar_AArow is invariant to the second (map) parallel dimension of the kernel. Note that this array is created somewhere deep inside the kernel, but it is fed as input to the second streamSeq and as such can be blocked tiled, i.e., used directly from shared memory.
Note that this requires a 2-D grid, but only a 1-dim block (i.e., the second map is split in one or a number of blocks), and the TILE size can be the dimension of the 1-d block. This would allow to amortize one global-memory access over TILE accesses from fast memory (related to invar_AArow_chunk).
Note however that if the invariant map would have been the first one, block-tiling would have required to interchange the parallel dimensions, i.e., bring it innermost because fast memory is shared only within the same workgroup.
(C) MM-Mult
fun [[f32,q],m] main([[f32,n],m] A, [[f32,q],n] B0) =
let B = transpose(B0) in
map(fn [f32,q] ( [f32,n] a ) => -- Kernel's first parallel dimension
map(fn f32 ( [f32,n] b ) => -- Kernel's second parallel dimension
streamSeq(fn f32 (int kk, [f32,chunk] a_c, [f32,chunk] b_c, int acc) =>
loop (acc) = for k < chunk do
acc + a_c[k]*b_c[k]
in acc
, a, b, 0f32 )
, B)
, A )
The third example is the trickier one of matrix-matrix multiplication: here a is invariant to the second map, while b is invariant to the first map, and they are both fed to the streamSeq that makes the body. In order to cache both accesses in shared memory, one needs to create a 2-dimensional workgroup (so that accesses on both a and b can be shared).
How to Implement Tiling?
Once tiling opportunities have been detected, block tiling can be implemented by (i) adding a notion of fast (shared/local) memory to the Kernel, by (ii) inserting a special function call that performs the per-thread copying to fast memory just between the invariant streamSeq and its corresponding loop, and, finally, by (iii) replacing accesses to the invariant array with accesses from fast memory.
streamSeq provides a convenient notation for doing this because the indexing into array chunks starts at 0 and goes until chunk, hence exactly how we need to access fast memory! We use here a CUDA notation -- we will have to extend the kernel with some magic things such as a notion of block/workgroup-size (TILE) and indexing within a workgroup (threadIdx.x), and obviously, a notion of fast memory parameterized over the workgroup size, and a way to copy to fast memory, denoted below with copyFastMem.
(A) n-body
For n-body we assume a 1-dim block of size TILE, and as such i corresponds to blockIdx.x*blockDim.x + threadIdx.x. We choose the chunk of the invariant streamSeq to be TILE. The translated code becomes:
fun [f32, n] main(f32 eps, [f32, n] bodies) =
map( fn f32 (int i, f32 body) => -- 1-dim Kernel;
-- 1-dim workgroup/block of size TILE, threadIdx.x in [0,TILE-1]
-- fast memory: shMem_bodies[TILE]
streamSeq( fn f32 (int kk, [f32, TILE] bodies_chunk, f32 acc) =>
let shMem_bodies = copyFastMem(bodies_chunk[threadIdx.x], threadIdx.x) in
loop (acc) = for k < TILE do
acc + (shMem_bodies[k] - body)
in acc
, bodies, 0f32)
, zip(iota(n), bodies)
Note that with this bit of magic, when we finally translate a streamSeq to a loop we need to obviously extend the indexing to arrays such as bodies_chunk but not to fast-memory arrays, such as shMem_bodies. Note that copyFastMem should actually check that the index of the global-memory array is in range, and if it is not it should obviously not write to shared memory.
(B) lavaMD
We treat lavaMD similarly, but note that while the grid has semantically a 2-D structure, we need only use a 1-dim structure for the block/workgroup (of size TILE), such as one or multiple blocks cover the inner (second) map. This is potentially subject to multi-versioned code since this makes sense only if the second map is large enough! We assume for simplicity that TILE divides m, the size of the second map.
fun [[f32, m], n] main([[f32, m], n] AA, [[int, p],n] inds_nhbs) =
map( fn [f32,m] ([f32,m] A, int i) => -- Kernel's first parallel dimension
map( fn f32 (f32 a) => -- Kernel's second parallel dimension
-- 1-dim Block of size TILE, threadIdx.x in [0,TILE-1]
-- fast memory: shMem_AArow[TILE]
streamSeq( fn f32 (int kk1, [int, chunk1] inds_nhbs, f32 acc1) =>
loop (acc1) = for k1 < chunk1 do
let ind = inds_nhbs[i, k1] in
let invar_AArow = AA[ind] in
streamSeq( fn f32 (int kk2, [int, TILE] invar_AArow_chunk, f32 acc2) =>
let shMem_AArow = copyFastMem(
invar_AArow_chunk[threadIdx.x],
threadIdx.x ) in
loop (acc2) = for k2 < TILE do
acc2 + shMem_AArow[k2] * 2.0f32
in acc2
, invar_AArow, acc1)
in acc1
, iota(inds_nhbs), 0f32)
, A)
, zip(AA, iota(n)))
(C) MM-Mult
This is the tricky case because is requires 2-dim tiling, i.e., need to create a 2-dim block/workspace such that the access to both arrays can be tiled in fast memory. To understand how the indexing is transformed, it is instructive to review the transformation first: Semantically the two maps that give the parallel dimensions of the kernel are strip-mined with a TILE factor and then the outer strip is interchanged inward, yielding the doall structure below, which corresponds to the original map-based program:
DOALL ii = 0, m-1, TILE -- (ii/TILE) gives the workgroup number on Y dimension
DOALL jj = 0, q-1, TILE -- (jj/TILE) gives the workgroup number on X dimension
DOALL i = ii, min(ii+TILE-1, m-1), 1 -- within workgroup: (i - ii) => threadIdx.y
DOALL j = jj, min(jj+TILE-1, q-1), 1 -- within workgroup: (j - jj) => threadIdx.x
-- kernel code
acc = 0.0;
DO kk = 0, n-1, TILE
shMemA[i-ii, j-jj] = A[i, j - jj + kk];
shMemB[j-jj, i-ii] = B[j, i - ii + kk];
__sync_barrier_sh_mem();
DO k = kk, min(kk+TILE-1, n-1), 1
acc += shMemA[i-ii, k - kk] * shMemB[j-jj, k - kk]
ENDDO ENDDO
C[i,j] = acc
ENDDOALL ENDDOALL ENDDOALL ENDDOALL
Rewriting it as a Futhark kernel this would correspond to:
fun [[f32,q],m] main([[f32,n],m] A, [[f32,q],n] B0) =
let B = transpose(B0) in
map(fn [f32,q] (int i, [f32,n] a ) => -- Kernel's first parallel dimension
map(fn f32 ( int j, [f32,n] b ) => -- Kernel's second parallel dimension
-- 2-dim Block of size TILE x TILE,
-- threadIdx.x, threadIdx.y in [0,TILE-1]
-- fast memory: shMemA[TILE,TILE]
-- fast memory: shMemB[TILE,TILE+1]
streamSeq(fn f32 (int kk, [f32,chunk] a_c, [f32,chunk] b_c, int acc) =>
let shMemA = copyFastMem( a_c[threadIdx.x+kk]
, threadIdx.y, threadIdx.x ) in
let shMemB = copyFastMem( b_c[threadIdx.y+kk]
, threadIdx.x, threadIdx.y ) in
loop (acc) = for k < chunk do
acc += shMemA[threadIdx.y, k] * shMemB[threadIdx.x, k]
in acc
, a, b, 0f32 )
, zip(iota(q),B) )
, zip(iota(m),A) )
Three Important Optimization Observations for Matrix-Matrix Multiplication (2-dim Tiling)
First, note that in the code above the access to global-memory array B in the imperative code and to array b_c in the Futhark-kernel code when copying to fast memory is UNCOALESCED, because its innermost index is threadIdx.y+kk, i.e., coalesced would require the innermost index to end in threadIdx.x! This will kill all the performance gained, i.e., is about 3.8x slower than what it should be. As such I am counting on the transpose to be delayed and to change the indexing into the coalesced one (and if for some reason the transpose is missing then matrix B should be transposed).
Second, note the inner dimension of shMemB is TILE+1 instead of TILE. This is because the (uncoalesced) access to shMemB (both read and write) will result in frequent bank-conflicts if the inner size is a multiple of 16, i.e., TILE. So making it TILE+1 alleviates that to a good extent. Note however that on my NVIDIA card, if you also make the inner dimension of shMemA to be TILE+1 this generates a 2x slowdown because of plummeting occupancy (I assume).
Another workaround would be to also semantically transpose shMemB, but if too difficult, the TILE+1 inner dimension should do the trick.
Third, when doing block-tiling one should aim to maximize the workgroup/block size, since one access to global memory is amortized over TILE accesses to fast memory. This might be less important for one-dimensional tiling since the gain from 256 to 1024 amortization is probably small (diminishing returns). However for two-dimensional tiling it seems to make a difference: the matrix-matrix multiplication with TILE=32 is about 1.2x times faster than with TILE=16.
Thoughts on How to Achieve (Simplified) Parboil-like Tiling for Matrix-Matrix Multiplication
This is non-trivial, and requires that we sequentialize a tile-size factor corresponding to the tile of the second parallel dimension, but can be achieved in three logical steps:
The first step is two-dimensional tiling as before, but only the B accesses are optimized/mapped to shared memory (the tile sizes on each dimension do not have to be equal, e.g., T1 and T2).
The second step is to strip mine the innermost parallel dimension by T2, i.e., the map on B is rewritten as a nested streamPar-map, and the internal streamSeq SOACs are re-written to loops.
The third step is to move the stripmined-map to the innermost position, for example by applying kernel extraction (use map distribution and loop-map interchange), and finally to sequentialize it. Note that each thread now produces T2 elements -- these should be kept in private memory (register tiling), i.e., when it exhibits regular access, private memory is efficiently cached. Since the accesses to the A array are invariant to the strip-mined map on B, then one element of A, hold in a register will be used inside the stripmined map of B, hence amortizing global memory accesses for A.
Block Tiling Using Shared Memory
We will use three simplified examples from n-body, lavaMD and Matrix-Matrix Multiplication benchmarks. The notation of streamSeq might differ from what is curently in Futhark -- I keep around the
chunk
size and the current indexk
of streamSeq, i.e., the next chunk starts at array elementk
and ends at array elementk+chunk-1
.I assume the analysis is run on extracted kernels, i.e., which can be seen as a perfect map nest that is going to be executed in parallel on GPU. I also assume that map, redomap, scan, filter, streamMap/Red, etc., appearing inside the kernel have been turned to StreamSeq constructs.
1. How to Recognize the Beneficial Cases for Block Tiling?
In essence block-tiling is beneficial whenever we can detect a streamSeq whose input array is invariant to one of the parallel dimensions of the kernel.
(A) n-body
n-body
is the simplest case demonstrating such an access: I assume the kernel is one-dimensional and contains only the outermostmap
which is fed arraybodies
. However, arraybodies
is also fed to thestreamSeq
that forms the kernel body. Assuming the one-dimensional block size is TILE (e.g., 128, 256, etc.), one case choosechunk = TILE
, and just before theloop
make the block threads cooperatively write the currentTILE
-chunk ofbodies-chunk
to fast (local/shared) memory such that inside the loop all accesses tobodies_chunk
are from fast memory (rather than global memory, even if, in this case, it might be efficiently cached).(B) lavaMD
lavaMD presents a slightly more complex case that basically demonstrates that we cannot rely on simple pattern-matching, but rather we need to do a full pass of invariant-array detection. In the example above
invar_AArow
is invariant to the second (map
) parallel dimension of the kernel. Note that this array is created somewhere deep inside the kernel, but it is fed as input to the secondstreamSeq
and as such can be blocked tiled, i.e., used directly from shared memory.Note that this requires a 2-D grid, but only a 1-dim block (i.e., the second map is split in one or a number of blocks), and the
TILE
size can be the dimension of the 1-d block. This would allow to amortize one global-memory access overTILE
accesses from fast memory (related toinvar_AArow_chunk
).Note however that if the invariant
map
would have been the first one, block-tiling would have required to interchange the parallel dimensions, i.e., bring it innermost because fast memory is shared only within the same workgroup.(C) MM-Mult
The third example is the trickier one of matrix-matrix multiplication: here
a
is invariant to the secondmap
, whileb
is invariant to the firstmap
, and they are both fed to thestreamSeq
that makes the body. In order to cache both accesses in shared memory, one needs to create a 2-dimensional workgroup (so that accesses on botha
andb
can be shared).How to Implement Tiling?
Once tiling opportunities have been detected, block tiling can be implemented by (i) adding a notion of fast (shared/local) memory to the Kernel, by (ii) inserting a special function call that performs the per-thread copying to fast memory just between the invariant
streamSeq
and its corresponding loop, and, finally, by (iii) replacing accesses to the invariant array with accesses from fast memory.streamSeq
provides a convenient notation for doing this because the indexing into array chunks starts at0
and goes untilchunk
, hence exactly how we need to access fast memory! We use here a CUDA notation -- we will have to extend the kernel with some magic things such as a notion of block/workgroup-size (TILE
) and indexing within a workgroup (threadIdx.x
), and obviously, a notion of fast memory parameterized over the workgroup size, and a way to copy to fast memory, denoted below withcopyFastMem
.(A) n-body
For
n-body
we assume a 1-dim block of sizeTILE
, and as suchi
corresponds toblockIdx.x*blockDim.x + threadIdx.x
. We choose thechunk
of the invariantstreamSeq
to beTILE
. The translated code becomes:Note that with this bit of magic, when we finally translate a
streamSeq
to a loop we need to obviously extend the indexing to arrays such asbodies_chunk
but not to fast-memory arrays, such asshMem_bodies
. Note thatcopyFastMem
should actually check that the index of the global-memory array is in range, and if it is not it should obviously not write to shared memory.(B) lavaMD
We treat lavaMD similarly, but note that while the grid has semantically a 2-D structure, we need only use a 1-dim structure for the block/workgroup (of size
TILE
), such as one or multiple blocks cover the inner (second)map
. This is potentially subject to multi-versioned code since this makes sense only if the secondmap
is large enough! We assume for simplicity thatTILE
dividesm
, the size of the secondmap
.(C) MM-Mult
This is the tricky case because is requires 2-dim tiling, i.e., need to create a 2-dim block/workspace such that the access to both arrays can be tiled in fast memory. To understand how the indexing is transformed, it is instructive to review the transformation first: Semantically the two maps that give the parallel dimensions of the kernel are strip-mined with a
TILE
factor and then the outer strip is interchanged inward, yielding thedoall
structure below, which corresponds to the originalmap
-based program:Rewriting it as a Futhark kernel this would correspond to:
Three Important Optimization Observations for Matrix-Matrix Multiplication (2-dim Tiling)
First, note that in the code above the access to global-memory array
B
in the imperative code and to arrayb_c
in the Futhark-kernel code when copying to fast memory is UNCOALESCED, because its innermost index isthreadIdx.y+kk
, i.e., coalesced would require the innermost index to end inthreadIdx.x
! This will kill all the performance gained, i.e., is about 3.8x slower than what it should be. As such I am counting on thetranspose
to be delayed and to change the indexing into the coalesced one (and if for some reason thetranspose
is missing then matrixB
should be transposed).Second, note the inner dimension of
shMemB
isTILE+1
instead ofTILE
. This is because the (uncoalesced) access to shMemB (both read and write) will result in frequent bank-conflicts if the inner size is a multiple of16
, i.e.,TILE
. So making itTILE+1
alleviates that to a good extent. Note however that on my NVIDIA card, if you also make the inner dimension ofshMemA
to beTILE+1
this generates a 2x slowdown because of plummeting occupancy (I assume). Another workaround would be to also semantically transposeshMemB
, but if too difficult, theTILE+1
inner dimension should do the trick.Third, when doing block-tiling one should aim to maximize the workgroup/block size, since one access to global memory is amortized over
TILE
accesses to fast memory. This might be less important for one-dimensional tiling since the gain from256
to1024
amortization is probably small (diminishing returns). However for two-dimensional tiling it seems to make a difference: the matrix-matrix multiplication withTILE=32
is about1.2x
times faster than withTILE=16
.Thoughts on How to Achieve (Simplified) Parboil-like Tiling for Matrix-Matrix Multiplication
This is non-trivial, and requires that we sequentialize a tile-size factor corresponding to the tile of the second parallel dimension, but can be achieved in three logical steps:
The first step is two-dimensional tiling as before, but only the B accesses are optimized/mapped to shared memory (the tile sizes on each dimension do not have to be equal, e.g., T1 and T2).
The second step is to strip mine the innermost parallel dimension by T2, i.e., the
map
onB
is rewritten as a nestedstreamPar-map
, and the internalstreamSeq
SOACs are re-written to loops.The third step is to move the stripmined-map to the innermost position, for example by applying kernel extraction (use map distribution and loop-map interchange), and finally to sequentialize it. Note that each thread now produces T2 elements -- these should be kept in private memory (register tiling), i.e., when it exhibits regular access, private memory is efficiently cached. Since the accesses to the A array are invariant to the strip-mined map on B, then one element of A, hold in a register will be used inside the stripmined map of B, hence amortizing global memory accesses for A.