Register + Block Tiling

The motivation is of course matrix multiplication, but we will complicate things just a little bit and use as running example DGEMM. We assume matrices:

A of size m x n
B of size n x q
C' and C of size m x n

and would like to compute

C' := alpha A B + beta * C

We will demonstrate the transformation using a C-like pseudocode and then we will discuss how that translates to OpenCL code. We assume the following source code:

forall ( i = 0; i < m; i++ ) {
   forall (j = 0; j < q; j++) {
      float c = 0.0;
      for (k = 0; k < n; k++) {
         c += A[i,k] * B[k,j];
      }
      C'[i,j] = alpha * c + beta * C[i,j];
   }
}

We chose two tile sizes T and H:

we tile the first parallel dimension of index i by T,
we double-tile the second parallel dimension of index j by T * H and then by T.
we tile the sequential dimensions of index k by H

Since the dimensions of index i and j are parallel, we can move/interchange them inwards as we desire.

we move the i tile innermost and sequentialize it;
we move the double tiles of dimension of index j inside the k tile; these will correspond to OpenCL workgroup parallelism

The resulting code is:

forall ( ii = 0; ii < m; ii += T ) {
   forall ( jjj = 0; jjj < q; jjj += T*H ) {
      float c[T][H][T];

      forall ( jj = jjj; jj < min(jjj+T*H, q); jj += T ) {
         forall ( j = jj; j < min(jj+T, q); j += 1 ) {
            for ( i = ii; i < min(ii+T, m); i += 1 ) {
               c[(jj-jjj)/T][j-jj][i-ii] = 0.0;
      }  }  }

      for ( kk = 0; kk < n; kk += H ) {
         // here we will insert a collective copy in OpenCL code
         // of the slice: A[ii : ii+T, kk : kk + H]
         for (k = kk; k < min(kk+H, n); k += 1 ) {
            forall ( jj = jjj; jj < min(jjj+T*H, q); jj += T ) {
               forall ( j = jj; j < min(jj+T, q); j += 1 ) {
                  float b = B[k, j];
                  forall ( i = ii; i < min(ii+T, m); i += 1 ) {
                     c[(jj-jjj)/T][j-jj][i-ii] += A[i,k] * b;
      }  }  }  }  }

      forall ( jj = jjj; jj < min(jjj+T*H, q); jj += T ) {
         forall ( j = jj; j < min(jj+T, q); j += 1 ) {
            for ( i = ii; i < min(ii+T, m); i += 1 ) {
               C'[i,j] = alpha * c[(jj-jjj)/T][j-jj][i-ii] + beta * C[i,j];
      }  }  }
}  }

We will now discuss the translation to OpenCL. We use:

a two-dimensional workgroup of sizes {T, H} in OpenCL notation. Please note that this means that dimension T is innermost and dimension H is the outer one. The workgroup dimensions corresponds to the parallel loops of indices j and jj.
a two-dimensional grid of sizes {N * ceil(q / (H*T)), H * ceil(m / T)}. As before the the first dimension is actually the innermost one and the other the outermost one. The two dimensions correspond to the two levels of outer parallelism of indices jjj and ii.

The OpenCL code is:

    int i, kk, k;
    float c[T];
    __local float A_sh[T][H];

    int thd_x = get_local_id(0);
    int thd_y = get_local_id(1);
    int jjj   = get_group_id(0) * H * T; 
    int jj    = jjj + thd_y * T;
    int j     = jj  + thd_x;
    int ii    = get_group_id(1) * T;

    for (i = ii; i < min(ii+T, m); i++) {
        c[i - ii] = 0;
    }

    for (kk = 0; kk < n; kk += H) {
        A_sh[thd_x][thd_y] = (ii+thd_x<m) && (kk+thd_y<n) ?
            A[(ii+thd_x)*n + kk + thd_y] : 0.0;
        barrier(CLK_LOCAL_MEM_FENCE);
        for (k = kk; k < min(kk+H, n); k++) {
            float b = (j < q) ? B[k*q + j] : 0.0;
            for (i = ii; i < min(ii+T, m); i++) {
                c[i-ii] += A_sh[i-ii][k-kk] * b;
            }
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }

    for (i = ii; i < min(ii+T, m); i++) {
        if (j < q) C'[i*q + j] = alpha * c[i-ii] + beta * C[i*q + j];
    }

We make the following important observations:

The collective read from array A, which is stored in global memory, is not coalesced, but the read from and the write to arrays B, C and C' is coalesced.
If we we double tile the first parallel dimension i and single tile j, we get coalesced read for A at the expense of un-coalesced read/write from/into C and C'. The latter seems more difficult to fix than the previous (i.e., the un-coalesced collective read from A).
In the shown version we can easily fix the non-coalesced access to A by choosing tiles of the same size, i.e., T = H. Then A_sh : [T][T]f32, and the collective read can be simply written as:
```
A_sh[thd_y][thd_x] = (ii+thd_y < m) && (kk+thd_x < n) ?
       A[(ii+thd_y)*n + kk+thd_x] : 0.0;
```
and the collective read from A is now coalesced.
Finally, if A is already in transposed form then please notice that the read from A is already coalesced.

Given the above observations, it seems possible to perform a judicious code generation that always performs coalesced accesses to global memory, no matter of the form of the layout of the A, B and C arrays, while minimizing the number of necessary transpositions. For example, several cases do not require any extra transposition:

if C is not in transposed form and C' is not immediately transposed after the kernel, then we always double tile the innermost parallel dimension of original index j.
if A is in transposed form then we keep the above code and we can choose T != H else if A is not in transposed form then we chose T == H and do the indexing trick above.
else if C and C' are in transpose form, then it is perhaps better to double tile the outermost parallel dimension of original index i ...

Implementing this transformation to Futhark seems to require extending the compiler IR: in the example above, each thread will actually compute a chunks of a column of the return array. The maximal chunk size is T, but not all threads will compute a full chunk because they may get out of the bounds of the destination array.

With respect to the actual implementation, a good starting point would be to rely on/extend incremental flattening to produce the case in which the body of the kernel is exactly a sequential stream, whose parameters are invariant to the two innermost dimensions of the grid, i.e., at least one stream input is invariant to the innermost parallel dimension and at least another one is invariant to the outer dimension. When the compiler finds this pattern, it looks possible (and safe) to pattern match the proposed (target) code. (Ideally, later on, we could actually write a transformation that actually moves the outermost tile in the innermost position -- assuming that there are other sequential recurrences inside the stream ...).

diku-dk / futhark

Register + Block Tiling #615