matmul synthesis experiments

description

this issue aims to record the list of experiments, and their results, performed on the matmul operation of the Graph Convolutional Network implemented in PyTorch. The cited operation is the following one

https://github.com/dmsgnn/master-thesis/blob/923ff3413b394fe438c4c16f362bb61a42a1f1af/pygcn/soda_gc1/output/01searched-edited.mlir#L23

how to run

the firsts passes performed, starting from the initial pygcn.mlir, are the same as https://github.com/dmsgnn/gnn-acceleration-master-thesis/issues/3#issuecomment-1553001773. The whole new process is reported here, for completeness.

after having run the pygcn model using the train.py file, and having correctly saved the pygcn.mlir file, the following procedures are required:

Create a folder called output, then cd in its parent folder

Use the following command to remove the tensor.empty() procedures (this solves the problem 2.)

docker run -u $(id -u) -v $(pwd):/working_dir --rm agostini01/soda \
                                        mlir-opt \
                                        --canonicalize \
                                        -convert-tensor-to-linalg \
                                        --empty-tensor-to-alloc-tensor \
                                        --eliminate-empty-tensors \
                                        -linalg-bufferize -arith-bufferize \
                                        -tensor-bufferize -func-bufferize \
                                        -finalizing-bufferize -buffer-deallocation \
                                        --buffer-results-to-out-params \
                                        --canonicalize -cse output/pygcn.mlir \
                                            2>&1 | cat > output/01searched-edited.mlir

Modify the just created file "01searched-edited.mlir" in the following way:

line 9 must be changed from ml_program.global private mutable @global_seed(dense<0> : tensor<i64>) : tensor<i64> to memref.global "private" @global_seed : memref<i64> = dense<0> (this solves the problem 1.)

(new) In order to remove the mlir deallocation, it is necessary to run, before setting soda.launch and soda.terminator, the following command

docker run -u $(id -u) -v $(pwd):/working_dir --rm agostini01/soda \
                                   soda-opt \
                                   --erase-buffer-deallocation \
                                   output/01searched-edited.mlir \
                                   2>&1 | ccat > output/01.mlir

now it is possible to add soda parameters, wrapping the desired part of computation as follows
```
soda.launch {
...
soda.terminator
}
```

(new) Then, this modified version is now ready to be used with soda-opt. For these experiments I am going to use the optimized version, in order to try to reduce as much as possible the amount of cycles needed. This part will be the one subjects to change, because I am going to try different soda optimizations to try to improve the performance of the matmul operation. Inspirations are going to be taken by the article How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance

    docker run -u $(id -u) -v $(pwd):/working_dir --rm agostini01/soda \
                         soda-opt \
                           -soda-outline-bambu-code \
                           -soda-extract-arguments-to-xml=using-bare-ptr \
                           -soda-generate-bambu-accelcode \
                           -soda-opt-pipeline-for-bambu="no-buffer-trick use-bare-ptr-memref-call-conv number-of-full-unrolls=1" \
                           -mlir-print-ir-after-all \
                           output/01searched-edited.mlir \
                           -o output/04optimized.mlir \
                           2>&1 | cat > output/05intermediate-optimized.mlir

Run the following command to obtain the .ll file

    docker run -u $(id -u) -v $(pwd):/working_dir --rm agostini01/soda \
                              mlir-translate -opaque-pointers=0 \
                                 --mlir-to-llvmir \
                                 output/04optimized.mlir \
                                 -o output/05optimized.ll

Now, the .ll file can be fount in the output directory .

once having created the .ll file, it is possible to run Bambu. To do so, we can cd into the parent directory of output, and execute the run-bambu.sh (download it and convert it to sh, this type is not supported by GitHub issue's comments) script using sh run-bambu.sh optimized. The script I am going to use has been adapted to run inside the soda docker using the bambu clang16 AppImage.

directories structure

Here it is reported how the structure of the directory need to be, before execution of step 3, including the required initial files

.
- bambu-ac_types-clang16.AppImage
- run-bambu.sh
- output
  - 01searched-edited.mlir
  - pygcn.mlir

Here, instead, it is reported how the final directories structure is expected to looks like

.
- bambu-ac_types-clang16.AppImage
- run-bambu.sh
- forward_kernel_interface.xml
- forward_kernel_test.xml
- output
  - 01searched-edited.mlir
  - 04optimized.mlir
  - 05intermediate-optimized.mlir
  - 05optimized.ll
  - pygcn.mlir
  - optimized
    - bambu_results_0.xml
    - bambu-log
    - forward_kernel_test.xml
    - forward_kernel.v
    - input.ll
    - results.txt
    - simulate_forward_kernel.sh
    - synthesize_Synthesis_forward_kernel.sh
    - visualize.ll
    - HLS_output
      - ...

number of cycles estimated

the matmul operation computes a multiplication between two matrices of sizes 2708x1433 and 1433x16, resulting in a n ew matrix of size 2708x16.

Each entry of the new matrix requires 1433 mul and 1432 add. each mul requires 2 cycles and each add requires 5 cycles.

The total number of cycles required is estimated to be equal to (1433*2 + 1432*5)*(2708*16) = 434.406.528

result 1

the result of the above how to run, without changes in soda-opt call, has given the following result

Reading of vector values from input file completed. Simulation started.
Simulation not completed into 200000000 cycles
Start reading vector           1's values from input file.

Reading of vector values from input file completed. Simulation started.
Simulation not completed into 200000000 cycles
File "/tmp/appimage_extracted_5e2bed6bce9f374a789f6866ea86a2ff/usr/results.txt" opened
error -> Expected a number of cycles different from zero. Something wrong happened during the simulation!

execution time -> 17h 17m 10s

result 2

the result of the original soda-opt how to run, with 2 full unrolls number-of-full-unrolls=2, has given the following result

/tmp/appimage_extracted_5e2bed6bce9f374a789f6866ea86a2ff/usr/bin//tool_select.sh: line 13:    11 Killed                  $BINARY_PATH "$@"
cp: cannot stat '/tmp/appimage_extracted_5e2bed6bce9f374a789f6866ea86a2ff/usr/bambu_results_0.xml': No such file or directory

soda execution time -> 5h 47m Bambu execution time -> 25h 40m

outcome

Unfortunately at the moment the results have not been successful. The next steps are to make experiments on a subset of the Cora dataset in order to understand how the matmul behaves and how it can be optimized. Then, it could be an option to synthesize the accelerator using the subset of the dataset, working on it and then explain the limitations encountered and future work to make the flow of the creation of the accelerator possible with all type of GNNs.

parallel experiment

In order to effectively understand the impact of some optimization passes, a parallel experiment has been conducted. In particular, the same Graph Convolutional Network has been run using a subpart of the Cora dataset. This subpart taken into consideration only 15 nodes, resulting in an adjiacency matrix of $15 \cdot 15$. The same matrices multiplication I am trying to accelerate changed from $(2708 \cdot 1433)\times(1433 \cdot 16)$ to $(15 \cdot 15)\times(15 \cdot 16)$.

The new estimated number of cycles is given by $15 \cdot 2 + 14 \cdot 5 = 100$ cycles for each entry of the new matrix. The resulting matrix is of size $15 \cdot 16$ with a total of $240$ entries. Finally, the total amount of cycles estimated can be computed as $240 \cdot 100 = 24.000$.

This matmul operation has been synthetized using soda-opt and Bambu in order to investigate the effective number of cycles needed.

structure	# cycles	% scale
baseline	33000	-
opt full unrl1	15900	52%
opt full unrl2	1230	96%
opt partl unrl1
opt partl unrl2

dot analysis and considerations

see experiments spreadsheet for full list and considerations.

strange behaviour

When using the following command

docker run -u $(id -u) -v $(pwd):/working_dir --rm agostini01/soda \
                                                    soda-opt \
                                                    -soda-outline-bambu-code \
                                                    -soda-extract-arguments-to-xml=using-bare-ptr \
                                                    -soda-generate-bambu-accelcode=no-aa \
                                                    -convert-linalg-to-affine-loops \
                                                    --affine-loop-unroll="unroll-num-reps=1" \
                                                    -lower-all-to-llvm=use-bare-ptr-memref-call-conv \
                                                    -mlir-print-ir-after-all \
                                                    output/01searched-edited.mlir \
                                                    -o output/04optimized.mlir \
                                                    2>&1 | cat > output/05intermediate-optimized.mlir

The mlir code changes in an unexpected way, from this

func.func @forward_kernel(%arg0: memref<15x15xf32>, %arg1: memref<15x16xf32>, %arg2: memref<15x16xf32>) {
  cf.br ^bb1
^bb1:  // pred: ^bb0
  affine.for %arg3 = 0 to 15 {
    affine.for %arg4 = 0 to 16 {
      affine.for %arg5 = 0 to 15 {
        %0 = affine.load %arg0[%arg3, %arg5] : memref<15x15xf32>
        %1 = affine.load %arg1[%arg5, %arg4] : memref<15x16xf32>
        %2 = affine.load %arg2[%arg3, %arg4] : memref<15x16xf32>
        %3 = arith.mulf %0, %1 : f32
        %4 = arith.addf %2, %3 : f32
        affine.store %4, %arg2[%arg3, %arg4] : memref<15x16xf32>
      }
    }
  }
  return
}

to this

func.func @forward_kernel(%arg0: memref<15x15xf32>, %arg1: memref<15x16xf32>, %arg2: memref<15x16xf32>) {
  cf.br ^bb1
^bb1:  // pred: ^bb0
  affine.for %arg3 = 0 to 15 {
    affine.for %arg4 = 0 to 16 {
      affine.for %arg5 = 0 to 12 step 4 {
        %0 = affine.load %arg0[%arg3, %arg5] : memref<15x15xf32>
        %1 = affine.load %arg1[%arg5, %arg4] : memref<15x16xf32>
        %2 = affine.load %arg2[%arg3, %arg4] : memref<15x16xf32>
        %3 = arith.mulf %0, %1 : f32
        %4 = arith.addf %2, %3 : f32
        affine.store %4, %arg2[%arg3, %arg4] : memref<15x16xf32>
        %5 = affine.apply affine_map<(d0) -> (d0 + 1)>(%arg5)
        %6 = affine.load %arg0[%arg3, %5] : memref<15x15xf32>
        %7 = affine.load %arg1[%5, %arg4] : memref<15x16xf32>
        %8 = affine.load %arg2[%arg3, %arg4] : memref<15x16xf32>
        %9 = arith.mulf %6, %7 : f32
        %10 = arith.addf %8, %9 : f32
        affine.store %10, %arg2[%arg3, %arg4] : memref<15x16xf32>
        %11 = affine.apply affine_map<(d0) -> (d0 + 2)>(%arg5)
        %12 = affine.load %arg0[%arg3, %11] : memref<15x15xf32>
        %13 = affine.load %arg1[%11, %arg4] : memref<15x16xf32>
        %14 = affine.load %arg2[%arg3, %arg4] : memref<15x16xf32>
        %15 = arith.mulf %12, %13 : f32
        %16 = arith.addf %14, %15 : f32
        affine.store %16, %arg2[%arg3, %arg4] : memref<15x16xf32>
        %17 = affine.apply affine_map<(d0) -> (d0 + 3)>(%arg5)
        %18 = affine.load %arg0[%arg3, %17] : memref<15x15xf32>
        %19 = affine.load %arg1[%17, %arg4] : memref<15x16xf32>
        %20 = affine.load %arg2[%arg3, %arg4] : memref<15x16xf32>
        %21 = arith.mulf %18, %19 : f32
        %22 = arith.addf %20, %21 : f32
        affine.store %22, %arg2[%arg3, %arg4] : memref<15x16xf32>
      }
      affine.for %arg5 = 12 to 15 {
        %0 = affine.load %arg0[%arg3, %arg5] : memref<15x15xf32>
        %1 = affine.load %arg1[%arg5, %arg4] : memref<15x16xf32>
        %2 = affine.load %arg2[%arg3, %arg4] : memref<15x16xf32>
        %3 = arith.mulf %0, %1 : f32
        %4 = arith.addf %2, %3 : f32
        affine.store %4, %arg2[%arg3, %arg4] : memref<15x16xf32>
      }
    }
  }
  return
}

The innermost loop's iterations are splitted between the first 12 and the last 3. The first 12 iterations have been unrolled with a factor 4 and, the for is repeated 3 times (due to step 4), instead the last three iterations are as before. My expectation was to have a loop from 0 to 14 with step 2 plus the last iteration, due to the unroll factor set to 1. Additionally, changing the unrolling factor does not affect the result.

update

The problem with the previous issue was the commnd I was passing to SODA. This --affine-loop-unroll="unroll-num-reps=1" must be changed to -affine-loop-unroll="unroll-factor=2".

Additionally, the unexpected changes was due to the clang optimizations made in a lower level. In order to get rid off these optimizations I needed to set -fno-unroll-loops to the Bambu command.

There are the new SODA and Bambu command that I am going to use to better understand unrolling optimizations and bottleneck

SODA

docker run -u $(id -u) -v $(pwd):/working_dir --rm agostini01/soda \
                                      soda-opt \
                                      -soda-outline-bambu-code \
                                      -soda-extract-arguments-to-xml=using-bare-ptr \
                                      -soda-generate-bambu-accelcode \
                                      -convert-linalg-to-affine-loops \
                                      -affine-loop-unroll="unroll-factor=1" \
                                      -lower-all-to-llvm=use-bare-ptr-memref-call-conv \
                                      -mlir-print-ir-after-all \
                                      output/01searched-edited.mlir \
                                      -o output/04optimized.mlir \
                                      2>&1 | cat > output/05intermediate-optimized.mlir

Bambu

docker run -u $(id -u):$(id -g) -v $(pwd):/working_dir --rm agostini01/soda \
sh -c "export APPIMAGE_EXTRACT_AND_RUN=1 \
&& env NO_CLEANUP=1 ./bambu-ac_types-clang16.AppImage -v3 --print-dot \
-lm --soft-float \
-fno-unroll-loops \
--compiler=I386_CLANG16  \
--device=${DEVICE} \
--clock-period=5 \
--experimental-setup=BAMBU-BALANCED-MP \
--channels-number=32 \
--memory-allocation-policy=NO_BRAM \
--disable-function-proxy \
--generate-tb=//working_dir/forward_kernel_test.xml \
--simulate --simulator=VERILATOR \
--top-fname=forward_kernel \
--output-temporary-directory=//working_dir \
//working_dir/input.ll 2>&1 | tee bambu-log \

GCN accelerator

I tried to synthetise the whole Graph Convolutional Network using the matrix with 15 nodes and it worked. This type of experiments, imo, should be the next phase of the experiments (accelerating the matmul will reduce the number of cycles, but anyway, sooner or later, there will be an adjacency matrix too large to be computed in less than 200mln of cycles, so I would not see this upper bound as a limitation).

error verilator

Error occured in Bambu on 2708 matrix with full unroll 1, 32 channels and no-bram.

1 warning generated.
/tmp/appimage_extracted_dec98ca795e7b679a3c81ccbbeee9016/usr/bin//tool_select.sh: line 13:   345 Killed                  $BINARY_PATH "$@"
Killed
%Error: Command Failed /usr/bin/verilator_bin --cc --exe --Mdir /tmp/appimage_extracted_dec98ca795e7b679a3c81ccbbeee9016/usr/HLS_output//verilator_beh/verilator_obj -Wno-fatal -Wno-lint -sv \+define\+M32 -CFLAGS -m32 -LDFLAGS -m32\ -lpthread /tmp/appimage_extracted_dec98ca795e7b679a3c81ccbbeee9016/usr/HLS_output//verilator_beh/libtb.so -O3 --output-split-cfuncs 3000 --output-split-ctrace 3000 --x-assign fast --x-initial fast --noassert /tmp/appimage_extracted_dec98ca795e7b679a3c81ccbbeee9016/usr/HLS_output//simulation/bambu_testbench.cpp /tmp/appimage_extracted_dec98ca795e7b679a3c81ccbbeee9016/usr/forward_kernel.v /tmp/appimage_extracted_dec98ca795e7b679a3c81ccbbeee9016/usr/HLS_output//simulation/bambu_testbench.v --top-module bambu_testbench
cp: cannot stat '/tmp/appimage_extracted_dec98ca795e7b679a3c81ccbbeee9016/usr/bambu_results_0.xml': No such file or directory

execution vm

Strange error occured in virtual machine when executing the same command as the verilator error

Ended execution of HLS::WriteHLSSummary:= in 0.77 seconds
Starting execution of HLS::SimulationEvaluation
dirname: missing operand
Try 'dirname --help' for more information.
dirname: missing operand
Try 'dirname --help' for more information.
error -> Returned error code!

Please report bugs to <panda-info@polimi.it>

dmsgnn / master-thesis