diku-dk / futhark

:boom::computer::boom: A data-parallel functional programming language
http://futhark-lang.org
ISC License
2.37k stars 164 forks source link

Register + Block Tiling #615

Closed coancea closed 3 years ago

coancea commented 6 years ago

The motivation is of course matrix multiplication, but we will complicate things just a little bit and use as running example DGEMM. We assume matrices:

and would like to compute

We will demonstrate the transformation using a C-like pseudocode and then we will discuss how that translates to OpenCL code. We assume the following source code:

forall ( i = 0; i < m; i++ ) {
   forall (j = 0; j < q; j++) {
      float c = 0.0;
      for (k = 0; k < n; k++) {
         c += A[i,k] * B[k,j];
      }
      C'[i,j] = alpha * c + beta * C[i,j];
   }
}

We chose two tile sizes T and H:

Since the dimensions of index i and j are parallel, we can move/interchange them inwards as we desire.

The resulting code is:

forall ( ii = 0; ii < m; ii += T ) {
   forall ( jjj = 0; jjj < q; jjj += T*H ) {
      float c[T][H][T];

      forall ( jj = jjj; jj < min(jjj+T*H, q); jj += T ) {
         forall ( j = jj; j < min(jj+T, q); j += 1 ) {
            for ( i = ii; i < min(ii+T, m); i += 1 ) {
               c[(jj-jjj)/T][j-jj][i-ii] = 0.0;
      }  }  }

      for ( kk = 0; kk < n; kk += H ) {
         // here we will insert a collective copy in OpenCL code
         // of the slice: A[ii : ii+T, kk : kk + H]
         for (k = kk; k < min(kk+H, n); k += 1 ) {
            forall ( jj = jjj; jj < min(jjj+T*H, q); jj += T ) {
               forall ( j = jj; j < min(jj+T, q); j += 1 ) {
                  float b = B[k, j];
                  forall ( i = ii; i < min(ii+T, m); i += 1 ) {
                     c[(jj-jjj)/T][j-jj][i-ii] += A[i,k] * b;
      }  }  }  }  }

      forall ( jj = jjj; jj < min(jjj+T*H, q); jj += T ) {
         forall ( j = jj; j < min(jj+T, q); j += 1 ) {
            for ( i = ii; i < min(ii+T, m); i += 1 ) {
               C'[i,j] = alpha * c[(jj-jjj)/T][j-jj][i-ii] + beta * C[i,j];
      }  }  }
}  }

We will now discuss the translation to OpenCL. We use:

The OpenCL code is:

    int i, kk, k;
    float c[T];
    __local float A_sh[T][H];

    int thd_x = get_local_id(0);
    int thd_y = get_local_id(1);
    int jjj   = get_group_id(0) * H * T; 
    int jj    = jjj + thd_y * T;
    int j     = jj  + thd_x;
    int ii    = get_group_id(1) * T;

    for (i = ii; i < min(ii+T, m); i++) {
        c[i - ii] = 0;
    }

    for (kk = 0; kk < n; kk += H) {
        A_sh[thd_x][thd_y] = (ii+thd_x<m) && (kk+thd_y<n) ?
            A[(ii+thd_x)*n + kk + thd_y] : 0.0;
        barrier(CLK_LOCAL_MEM_FENCE);
        for (k = kk; k < min(kk+H, n); k++) {
            float b = (j < q) ? B[k*q + j] : 0.0;
            for (i = ii; i < min(ii+T, m); i++) {
                c[i-ii] += A_sh[i-ii][k-kk] * b;
            }
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }

    for (i = ii; i < min(ii+T, m); i++) {
        if (j < q) C'[i*q + j] = alpha * c[i-ii] + beta * C[i*q + j];
    }

We make the following important observations:

Given the above observations, it seems possible to perform a judicious code generation that always performs coalesced accesses to global memory, no matter of the form of the layout of the A, B and C arrays, while minimizing the number of necessary transpositions. For example, several cases do not require any extra transposition:

Implementing this transformation to Futhark seems to require extending the compiler IR: in the example above, each thread will actually compute a chunks of a column of the return array. The maximal chunk size is T, but not all threads will compute a full chunk because they may get out of the bounds of the destination array.

With respect to the actual implementation, a good starting point would be to rely on/extend incremental flattening to produce the case in which the body of the kernel is exactly a sequential stream, whose parameters are invariant to the two innermost dimensions of the grid, i.e., at least one stream input is invariant to the innermost parallel dimension and at least another one is invariant to the outer dimension. When the compiler finds this pattern, it looks possible (and safe) to pattern match the proposed (target) code. (Ideally, later on, we could actually write a transformation that actually moves the outermost tile in the innermost position -- assuming that there are other sequential recurrences inside the stream ...).

athas commented 3 years ago

We did this.