cornell-zhang / heterocl

HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Heterogeneous Computing
https://cornell-zhang.github.io/heterocl/
Apache License 2.0
326 stars 92 forks source link

Issue with systolic array based large matrix multiplication #267

Open jessewjx opened 4 years ago

jessewjx commented 4 years ago

Hi !   I am deploying the heteroCL on the Vitis platform and running the systolic array based matrix multiplication sample systolic_array_vitis.py. Here are three issues I came across, wondering whether we can possibly ask for some suggestions?

  1. In the original kernel implementation in the systolic_array_vitis.py , the dot product result in the systolic array implementation seems to be inconsistent with the classical matrix multiplication. The generated Vivado HLS C++ kernel code is shown below, where the localB matrix has the danger of index out of range.
    bit32 localB[15][16];
    for (bit32 args = 0; args < 15; ++args) {
      for (bit32 args0 = 0; args0 < 16; ++args0) {
        localB[args][args0] = 0;
      }
    }
    for (bit32 args1 = 0; args1 < 16; ++args1) {
      for (bit32 args01 = 0; args01 < 16; ++args01) {
        output[args1][args01] = 0;
      }
    }
    bit32 localA[16][15];
    for (bit32 args2 = 0; args2 < 16; ++args2) {
      for (bit32 args02 = 0; args02 < 15; ++args02) {
        localA[args2][args02] = 0;
      }
    }
    bit32 update;
    for (bit32 k = 0; k < 16; ++k) {
    #pragma HLS pipeline
      for (bit32 y = 0; y < 16; ++y) {
        for (bit32 x = 0; x < 16; ++x) {
          bit32 temp_localA;
          temp_localA = ((0 < x) ? ((bit32)localA[y][(x + -1)]) : ((bit32)A[y][k]));
          bit32 temp_localB;
          temp_localB = ((0 < y) ? ((bit32)localB[(y + -1)][x]) : ((bit32)B[k][x]));
          output[y][x] = ((bit32)(((ap_int<65>)((k == 0) ? ((bit32)0) : ((bit32)output[y][x]))) + ((ap_int<65>)(((ap_int<64>)temp_localA) * ((ap_int<64>)temp_localB)))));
          localA[((x / 15) + y)][(x % 15)] = temp_localA;
          localB[y][x] = temp_localB;
        }
      }
    }

    After changing the dimension of the local matrix to be consistent with the input matrix, the inconsistency is solved.

    def kernel(A, B):
           localA = hcl.compute((m, k), lambda *args: 0, "localA")
           localB = hcl.compute((k, n), lambda *args: 0, "localB")
           output = hcl.compute((m, n), lambda *args: 0, "output")
           def update(k, y, x):
     
               localA[y, x] = hcl.select(x>0, localA[y, x-1], A[y, k])
               localB[y, x] = hcl.select(y>0, localB[y-1, x], B[k, x])
               output[y, x] = hcl.select(k==0, 0, output[y, x]) + localA[y, x] * localB[y, x]
     
           hcl.mutate((m, dim_y, dim_x),
               lambda k, y, x: update(k, y, x), name="update")
           return output
  2. When trying to run with 1024x1024 matrix multiplication, the output is an zero matrix,
  1. Referring to #228 , after setting compile ="vivado_hls" , the synthesis process can be finished otherwise vitis platform info missing is reported.
hecmay commented 4 years ago

Hi jessewjx,

  1. The version you are using does not really work on large matrix. For 1024x1024 GEMM, It tries to implement a systolic array with 1024x1024 PEs, which is impossible to be implemented on a single FPGA.

    We have a new version that tiles the matrix into smaller pieces and performs computation one by one -- e.g. it tiles a 1024x1024 matrix into a group of 32x32 matrices, passes these sub-matrices to a 32x32 systolic array, and finally sums the results up. I am still testing it and will upload the new version within a few days. Will keep you posted.

  2. Please make sure v++ is on the PATH by running which v++. You should set up the XRT and Vitis environment before using HeteroCL's vitis backend.

jessewjx commented 4 years ago

Hi Hecmay,

Thank you for the reply and looking forward to the the systolic array based tiled matrix multiplication implementation! Also, we are wondering whether it is possible to achieve 4096x4096 matrix vector multiplication with the implementation you mentioned? thanks!

hecmay commented 4 years ago

Yeah. You can parameterize the algorithm as matrix A (4096x4096) multiplied with B (4096x1).

jessewjx commented 4 years ago

Great! that is very inspiring and looking forward to the release of the new version implementation of systolic array based matrix multiplication!

hecmay commented 4 years ago

Hi @jessewjx. Please see the systolic gemm here: https://github.com/Hecmay/heterocl/blob/fix/samples/systolic_array/systolic_array_module_stream.py

It may take a long time for this PR to be merged. You can simply pull back and compile the fix branch to give it a try: https://github.com/Hecmay/heterocl/blob/fix/

jessewjx commented 4 years ago

Thank you @Hecmay and we will have a try with the systolic gemm!

jessewjx commented 4 years ago

Hi @Hecmay, I tried to run the systolic gemm on the Vitis platform. Probably due to some setup issue, “vitis platform information missing ” occurred. Please note that the vitis environment is already set up on the server.

I then generated Vivado_HLS C++ code by setting the target to “vivado_hls” and ran the C simulation. I tested the systolic array implementation with the matrix vector multiplication with matrix size=32x32 and systolic size=4 .

The resulting vector does not agree with the simple matrix vector product. My hunch is that this could have happened because I was not executing the program correctly or that there is a minor bug in the current implementation.

Could you advice on this issue? It would be great if you could give some rough guides on how to execute the program as well. Thanks for writing this nice piece of software, hoping to hear from you soon.