Closed dmsgnn closed 1 year ago
the matmul operation computes a multiplication between two matrices of sizes 2708x1433 and 1433x16, resulting in a n ew matrix of size 2708x16.
Each entry of the new matrix requires 1433 mul and 1432 add. each mul requires 2 cycles and each add requires 5 cycles.
The total number of cycles required is estimated to be equal to (1433*2 + 1432*5)*(2708*16) = 434.406.528
the result of the above how to run, without changes in soda-opt
call, has given the following result
Reading of vector values from input file completed. Simulation started.
Simulation not completed into 200000000 cycles
Start reading vector 1's values from input file.
Reading of vector values from input file completed. Simulation started.
Simulation not completed into 200000000 cycles
File "/tmp/appimage_extracted_5e2bed6bce9f374a789f6866ea86a2ff/usr/results.txt" opened
error -> Expected a number of cycles different from zero. Something wrong happened during the simulation!
execution time -> 17h 17m 10s
the result of the original soda-opt how to run, with 2 full unrolls number-of-full-unrolls=2
, has given the following result
/tmp/appimage_extracted_5e2bed6bce9f374a789f6866ea86a2ff/usr/bin//tool_select.sh: line 13: 11 Killed $BINARY_PATH "$@"
cp: cannot stat '/tmp/appimage_extracted_5e2bed6bce9f374a789f6866ea86a2ff/usr/bambu_results_0.xml': No such file or directory
soda
execution time -> 5h 47m
Bambu
execution time -> 25h 40m
Unfortunately at the moment the results have not been successful. The next steps are to make experiments on a subset of the Cora dataset in order to understand how the matmul behaves and how it can be optimized. Then, it could be an option to synthesize the accelerator using the subset of the dataset, working on it and then explain the limitations encountered and future work to make the flow of the creation of the accelerator possible with all type of GNNs.
In order to effectively understand the impact of some optimization passes, a parallel experiment has been conducted. In particular, the same Graph Convolutional Network has been run using a subpart of the Cora dataset.
This subpart taken into consideration only 15 nodes, resulting in an adjiacency matrix of $15 \cdot 15
$. The same matrices multiplication I am trying to accelerate changed from $(2708 \cdot 1433)\times(1433 \cdot 16)
$ to $(15 \cdot 15)\times(15 \cdot 16)
$.
The new estimated number of cycles is given by $15 \cdot 2 + 14 \cdot 5 = 100
$ cycles for each entry of the new matrix.
The resulting matrix is of size $15 \cdot 16
$ with a total of $240
$ entries. Finally, the total amount of cycles estimated can be computed as $240 \cdot 100 = 24.000
$.
This matmul operation has been synthetized using soda-opt and Bambu in order to investigate the effective number of cycles needed.
structure | # cycles | % scale |
---|---|---|
baseline | 33000 | - |
opt full unrl1 | 15900 | 52% |
opt full unrl2 | 1230 | 96% |
opt partl unrl1 | ||
opt partl unrl2 |
see experiments spreadsheet for full list and considerations.
When using the following command
docker run -u $(id -u) -v $(pwd):/working_dir --rm agostini01/soda \
soda-opt \
-soda-outline-bambu-code \
-soda-extract-arguments-to-xml=using-bare-ptr \
-soda-generate-bambu-accelcode=no-aa \
-convert-linalg-to-affine-loops \
--affine-loop-unroll="unroll-num-reps=1" \
-lower-all-to-llvm=use-bare-ptr-memref-call-conv \
-mlir-print-ir-after-all \
output/01searched-edited.mlir \
-o output/04optimized.mlir \
2>&1 | cat > output/05intermediate-optimized.mlir
The mlir code changes in an unexpected way, from this
func.func @forward_kernel(%arg0: memref<15x15xf32>, %arg1: memref<15x16xf32>, %arg2: memref<15x16xf32>) {
cf.br ^bb1
^bb1: // pred: ^bb0
affine.for %arg3 = 0 to 15 {
affine.for %arg4 = 0 to 16 {
affine.for %arg5 = 0 to 15 {
%0 = affine.load %arg0[%arg3, %arg5] : memref<15x15xf32>
%1 = affine.load %arg1[%arg5, %arg4] : memref<15x16xf32>
%2 = affine.load %arg2[%arg3, %arg4] : memref<15x16xf32>
%3 = arith.mulf %0, %1 : f32
%4 = arith.addf %2, %3 : f32
affine.store %4, %arg2[%arg3, %arg4] : memref<15x16xf32>
}
}
}
return
}
to this
func.func @forward_kernel(%arg0: memref<15x15xf32>, %arg1: memref<15x16xf32>, %arg2: memref<15x16xf32>) {
cf.br ^bb1
^bb1: // pred: ^bb0
affine.for %arg3 = 0 to 15 {
affine.for %arg4 = 0 to 16 {
affine.for %arg5 = 0 to 12 step 4 {
%0 = affine.load %arg0[%arg3, %arg5] : memref<15x15xf32>
%1 = affine.load %arg1[%arg5, %arg4] : memref<15x16xf32>
%2 = affine.load %arg2[%arg3, %arg4] : memref<15x16xf32>
%3 = arith.mulf %0, %1 : f32
%4 = arith.addf %2, %3 : f32
affine.store %4, %arg2[%arg3, %arg4] : memref<15x16xf32>
%5 = affine.apply affine_map<(d0) -> (d0 + 1)>(%arg5)
%6 = affine.load %arg0[%arg3, %5] : memref<15x15xf32>
%7 = affine.load %arg1[%5, %arg4] : memref<15x16xf32>
%8 = affine.load %arg2[%arg3, %arg4] : memref<15x16xf32>
%9 = arith.mulf %6, %7 : f32
%10 = arith.addf %8, %9 : f32
affine.store %10, %arg2[%arg3, %arg4] : memref<15x16xf32>
%11 = affine.apply affine_map<(d0) -> (d0 + 2)>(%arg5)
%12 = affine.load %arg0[%arg3, %11] : memref<15x15xf32>
%13 = affine.load %arg1[%11, %arg4] : memref<15x16xf32>
%14 = affine.load %arg2[%arg3, %arg4] : memref<15x16xf32>
%15 = arith.mulf %12, %13 : f32
%16 = arith.addf %14, %15 : f32
affine.store %16, %arg2[%arg3, %arg4] : memref<15x16xf32>
%17 = affine.apply affine_map<(d0) -> (d0 + 3)>(%arg5)
%18 = affine.load %arg0[%arg3, %17] : memref<15x15xf32>
%19 = affine.load %arg1[%17, %arg4] : memref<15x16xf32>
%20 = affine.load %arg2[%arg3, %arg4] : memref<15x16xf32>
%21 = arith.mulf %18, %19 : f32
%22 = arith.addf %20, %21 : f32
affine.store %22, %arg2[%arg3, %arg4] : memref<15x16xf32>
}
affine.for %arg5 = 12 to 15 {
%0 = affine.load %arg0[%arg3, %arg5] : memref<15x15xf32>
%1 = affine.load %arg1[%arg5, %arg4] : memref<15x16xf32>
%2 = affine.load %arg2[%arg3, %arg4] : memref<15x16xf32>
%3 = arith.mulf %0, %1 : f32
%4 = arith.addf %2, %3 : f32
affine.store %4, %arg2[%arg3, %arg4] : memref<15x16xf32>
}
}
}
return
}
The innermost loop's iterations are splitted between the first 12 and the last 3. The first 12 iterations have been unrolled with a factor 4 and, the for is repeated 3 times (due to step 4), instead the last three iterations are as before. My expectation was to have a loop from 0 to 14 with step 2 plus the last iteration, due to the unroll factor set to 1. Additionally, changing the unrolling factor does not affect the result.
The problem with the previous issue was the commnd I was passing to SODA. This --affine-loop-unroll="unroll-num-reps=1"
must be changed to -affine-loop-unroll="unroll-factor=2"
.
Additionally, the unexpected changes was due to the clang optimizations made in a lower level. In order to get rid off these optimizations I needed to set -fno-unroll-loops
to the Bambu command.
There are the new SODA and Bambu command that I am going to use to better understand unrolling optimizations and bottleneck
SODA
docker run -u $(id -u) -v $(pwd):/working_dir --rm agostini01/soda \
soda-opt \
-soda-outline-bambu-code \
-soda-extract-arguments-to-xml=using-bare-ptr \
-soda-generate-bambu-accelcode \
-convert-linalg-to-affine-loops \
-affine-loop-unroll="unroll-factor=1" \
-lower-all-to-llvm=use-bare-ptr-memref-call-conv \
-mlir-print-ir-after-all \
output/01searched-edited.mlir \
-o output/04optimized.mlir \
2>&1 | cat > output/05intermediate-optimized.mlir
Bambu
docker run -u $(id -u):$(id -g) -v $(pwd):/working_dir --rm agostini01/soda \
sh -c "export APPIMAGE_EXTRACT_AND_RUN=1 \
&& env NO_CLEANUP=1 ./bambu-ac_types-clang16.AppImage -v3 --print-dot \
-lm --soft-float \
-fno-unroll-loops \
--compiler=I386_CLANG16 \
--device=${DEVICE} \
--clock-period=5 \
--experimental-setup=BAMBU-BALANCED-MP \
--channels-number=32 \
--memory-allocation-policy=NO_BRAM \
--disable-function-proxy \
--generate-tb=//working_dir/forward_kernel_test.xml \
--simulate --simulator=VERILATOR \
--top-fname=forward_kernel \
--output-temporary-directory=//working_dir \
//working_dir/input.ll 2>&1 | tee bambu-log \
I tried to synthetise the whole Graph Convolutional Network using the matrix with 15 nodes and it worked. This type of experiments, imo, should be the next phase of the experiments (accelerating the matmul will reduce the number of cycles, but anyway, sooner or later, there will be an adjacency matrix too large to be computed in less than 200mln of cycles, so I would not see this upper bound as a limitation).
Error occured in Bambu on 2708 matrix with full unroll 1, 32 channels and no-bram.
1 warning generated.
/tmp/appimage_extracted_dec98ca795e7b679a3c81ccbbeee9016/usr/bin//tool_select.sh: line 13: 345 Killed $BINARY_PATH "$@"
Killed
%Error: Command Failed /usr/bin/verilator_bin --cc --exe --Mdir /tmp/appimage_extracted_dec98ca795e7b679a3c81ccbbeee9016/usr/HLS_output//verilator_beh/verilator_obj -Wno-fatal -Wno-lint -sv \+define\+M32 -CFLAGS -m32 -LDFLAGS -m32\ -lpthread /tmp/appimage_extracted_dec98ca795e7b679a3c81ccbbeee9016/usr/HLS_output//verilator_beh/libtb.so -O3 --output-split-cfuncs 3000 --output-split-ctrace 3000 --x-assign fast --x-initial fast --noassert /tmp/appimage_extracted_dec98ca795e7b679a3c81ccbbeee9016/usr/HLS_output//simulation/bambu_testbench.cpp /tmp/appimage_extracted_dec98ca795e7b679a3c81ccbbeee9016/usr/forward_kernel.v /tmp/appimage_extracted_dec98ca795e7b679a3c81ccbbeee9016/usr/HLS_output//simulation/bambu_testbench.v --top-module bambu_testbench
cp: cannot stat '/tmp/appimage_extracted_dec98ca795e7b679a3c81ccbbeee9016/usr/bambu_results_0.xml': No such file or directory
Strange error occured in virtual machine when executing the same command as the verilator error
Ended execution of HLS::WriteHLSSummary:= in 0.77 seconds
Starting execution of HLS::SimulationEvaluation
dirname: missing operand
Try 'dirname --help' for more information.
dirname: missing operand
Try 'dirname --help' for more information.
error -> Returned error code!
Please report bugs to <panda-info@polimi.it>
description
this issue aims to record the list of experiments, and their results, performed on the matmul operation of the Graph Convolutional Network implemented in PyTorch. The cited operation is the following one
https://github.com/dmsgnn/master-thesis/blob/923ff3413b394fe438c4c16f362bb61a42a1f1af/pygcn/soda_gc1/output/01searched-edited.mlir#L23
how to run
the firsts passes performed, starting from the initial pygcn.mlir, are the same as https://github.com/dmsgnn/gnn-acceleration-master-thesis/issues/3#issuecomment-1553001773. The whole new process is reported here, for completeness.
after having run the
pygcn
model using thetrain.py
file, and having correctly saved thepygcn.mlir
file, the following procedures are required:Create a folder called output, then
cd
in its parent folderUse the following command to remove the
tensor.empty()
procedures (this solves the problem 2.)Modify the just created file "01searched-edited.mlir" in the following way:
ml_program.global private mutable @global_seed(dense<0> : tensor<i64>) : tensor<i64>
tomemref.global "private" @global_seed : memref<i64> = dense<0>
(this solves the problem 1.).ll
fileNow, the
.ll
file can be fount in the output directory ..ll
file, it is possible to run Bambu. To do so, we cancd
into the parent directory of output, and execute the run-bambu.sh (download it and convert it tosh
, this type is not supported by GitHub issue's comments) script usingsh run-bambu.sh optimized
. The script I am going to use has been adapted to run inside the soda docker using the bambu clang16 AppImage.directories structure
Here it is reported how the structure of the directory need to be, before execution of step 3, including the required initial files
Here, instead, it is reported how the final directories structure is expected to looks like