EECS-NTNU / bismo

BISMO: A Scalable Bit-Serial Matrix Multiplication Overlay for Reconfigurable Computing
BSD 3-Clause "New" or "Revised" License
128 stars 29 forks source link

Error on testing test_multibit_multitile in bismo/src/main/resources/cpp/app/BISMOTests.hpp #9

Closed yunchenlo closed 4 years ago

yunchenlo commented 4 years ago

Hello,

First thank you for your effort on developing bismo accelerator and all runtime lib. I face some test failure on bismo testing.

The following is my situation (I use PYNQ Z1 board with v2.4 pynq image):

[1bit config output error] When I config bismo to run alexnet 96, 363, 3025, 256, 2400, 729, 384, 2304, 169, 384, 3456, 169, 256, 3456, 169, 4096, 9216, 1, 4096, 4096, 1, 1000, 4096, 1

// Following is my code

  cout << "AlexNet START" << endl;
  all_OK &= test("layer1", 96, 363, 3025, 1, 1);
  all_OK &= test("layer2", 256, 2400, 729, 1, 1);
  all_OK &= test("layer3", 384, 2304, 169, 1, 1);
  all_OK &= test("layer4", 384, 3456, 169, 1, 1);
  all_OK &= test("layer5", 256, 3456, 169, 1, 1);
  all_OK &= test("layer6", 4096, 9216, 1, 1, 1);
  all_OK &= test("layer7", 4096, 4096, 1, 1, 1);
  all_OK &= test("layer8", 1000, 4096, 1, 1, 1);
  cout << "AlexNet END" << endl;

layers 384, 2304, 169, 384, 3456, 169, 256, 3456, 169 all cannot pass test And when pynq runned to 4096, 9216, 1 Pynq stalls and simply disconnected to host

[8bit Config output Error] Again when I config bismo to run alexnet 8b It says it is RHS is too big and cannot be supported by current runtime

Exception: RHS is too large and not currently supported in runtime library.

Could you help fix this issue? I think runtime lib should able to handle configuration from a commonly used network architecture for completeness.

Many Thanks, Yun-Chen

maltanar commented 4 years ago

Hi Yun-Chen,

If you are trying this from the master branch there is indeed a limitation on the maximum size of the matrices, due to how the scheduler works. You could try increasing the BRAM size while synthesizing the array using LMEM and RMEM parameters, or you can try the following branch from Johannes Kath and see if it solves your problem:

https://github.com/kathjo/bismo/tree/dynamic_fetch

yunchenlo commented 4 years ago

Hi,

Thank you for your reply! I will try that branch now.

In addition, do you plan to fix the computation mismatch between CPU & bismo for following config? Or should I change the bismo synthesis configuration? 384, 2304, 169, 384, 3456, 169, 256, 3456, 169,

This is one important part of my current work

Thank you very much, Yun-Chen

yunchenlo commented 4 years ago

Hi, I tried the branch from Johannes Kath @kathjo but it cannot passes even the CPU HW-SW cosimulation. Below is some of the log ...

[error]  found   : bismo.ResultController
[error]  required: T
[error]   val resultCtrl = Module(new ResultController()).io
[error]                           ^
[error] /home/yclo/Projects/bismo/src/main/scala/BISMO.scala:245: not found: type FPGAQueue
[error]   val fetchOpQ = Module(new FPGAQueue(new BISMOFetchRunInstruction(), myP.cmdQueueEntries)).io
[error]                             ^
[error] /home/yclo/Projects/bismo/src/main/scala/BISMO.scala:246: not found: type FPGAQueue
[error]   val execOpQ = Module(new FPGAQueue(new BISMOExecRunInstruction(), myP.cmdQueueEntries)).io
[error]                            ^
[error] /home/yclo/Projects/bismo/src/main/scala/BISMO.scala:247: not found: type FPGAQueue
[error]   val resultOpQ = Module(new FPGAQueue(new BISMOResultRunInstruction(), myP.cmdQueueEntries)).io
[error]                              ^
[error] /home/yclo/Projects/bismo/src/main/scala/BISMO.scala:250: not found: type AsymPipelinedDualPortBRAM
[error]     Module(new AsymPipelinedDualPortBRAM(
[error]                ^
[error] /home/yclo/Projects/bismo/src/main/scala/BISMO.scala:254: not found: value regIn
[error]       ), regIn = myP.bramPipelineBefore, regOut = myP.bramPipelineAfter
[error]          ^
[error] /home/yclo/Projects/bismo/src/main/scala/BISMO.scala:254: not found: value regOut
[error]       ), regIn = myP.bramPipelineBefore, regOut = myP.bramPipelineAfter
[error]                                          ^
[error] 321 errors found
[error] (compile:compileIncremental) Compilation failed
[error] Total time: 4 s, completed Dec 19, 2019 3:46:06 PM
Makefile:122: recipe for target '/home/yclo/Projects/bismo/build/2x64x2/VerilatedTester/hw/verilog/VerilatedTesterWrapper.v' failed
make: *** [/home/yclo/Projects/bismo/build/2x64x2/VerilatedTester/hw/verilog/VerilatedTesterWrapper.v] Error 1

Are there any other branch I could try?

Thank you, Yun-Chen

maltanar commented 4 years ago

Those errors are because fpga-tidbits is missing, it is a submodule -- it looks like you need to make a new clone of that branch with the --recurse-submodules flag with something this:

git clone --recurse-submodules --branch dynamic_fetch https://github.com/kathjo/bismo.git

or go into the existing cloned folder and run git submodule init; git submodule update

yunchenlo commented 4 years ago

Thank you !

After running

git submodule init; git submodule update

I am able to compile and run.

I will try my configuration for AlexNet .

yunchenlo commented 4 years ago

I have tested AlexNet config with 1b weight and 1b activation Successfully.

However, Although I use branch from Johannes Kath. I still suffer from runtime unsupported problem.

For example:

Enter rows depth cols, 0 to exit 
96 3025 363
Enter lhs and rhs bits: 
8 8
Exception: RHS is too large and not currently supported in runtime library.

Enter rows depth cols, 0 to exit 
256 2400 729
Enter lhs and rhs bits: 
8 4
Exception: LHS is too large and not currently supported in runtime library.

Enter rows depth cols, 0 to exit 
256 2400 729
Enter lhs and rhs bits: 
4 8
Exception: RHS is too large and not currently supported in runtime library

Will increase bismo hardware size might solve the problem?

Thank you and Merry X'mas , Yun-Chen

maltanar commented 4 years ago

I would double-check that the runtime library is recompiled from Johannes' fork (e.g. make sure you do a clean deployment and remove and old generated files from the old branch), but other than that I'm a bit puzzled. @kathjo any insight into this?

Increasing the hardware size is definitely something to try, though I would have guessed that your use cases would work with this branch.

(And merry xmas to you too :))

yunchenlo commented 4 years ago

Hi,

An update on this. I found previous reported unsupported exception is because I fail to build dynamic-fetch runtime library. The current bug is that only this array config still cannot run multiplication on 4x128x4 and 6x128x6 hardware:

root@pynq:/home/xilinx/bismo/PYNQZ1# LD_LIBRARY_PATH=$(pwd) ./testapp i
Enter rows depth cols, 0 to exit 
4096 9216 1
Enter lhs and rhs bits: 
4 8
Exception: RHS is too large and not currently supported in runtime library. 

A single RHS stripe (D_n * K * rhs-bits) does not fit into On-Chip-Memory

I am trying different hardware config now. However, 8x256x8 with 2048 LMEM, RMEM but fail to generate bismo.bit for PYNQZ1. I am trying other config such as 6x128x6, etc.

In addition, may I ask if reported runtime for a matrix multiplication contains time for moving test array from host to BISMO, preparing, computing, and moving results from BISMO back to host and compare:

Enter rows depth cols, 0 to exit 
1000 4096 1
Enter lhs and rhs bits: 
1 2
...
...
Runtime: 87449 cycles, 438007 ns

Thanks for your merry X'mas:)

yunchenlo commented 4 years ago

Hi @maltanar ,

After increasing hardware size to 4x256x4, I finally get to run 4096x9216x1 with LHS 4b, RHS 8b!

Woohoo! Thank you for your kind help!

Yun-Chen

maltanar commented 4 years ago

@jasonlo0509 glad to hear that! Regarding your question here:

In addition, may I ask if reported runtime for a matrix multiplication contains time for moving test array from host to BISMO, preparing, computing, and moving results from BISMO back to host and compare:

The answer to this depends on which of the benchmarking modes you are using. It looks like you may be using one of the modes that reports the time for all compute and memory movement (e.g. cpu->fpga, compute, fpga->cpu) but to make sure I would look at the two following files:

To see which benchmark fxn gets called when you make a selection: https://github.com/kathjo/bismo/blob/dynamic_fetch/src/main/resources/cpp/app/main.cpp#L37

To see what the benchmark functions themselves do: https://github.com/kathjo/bismo/blob/dynamic_fetch/src/main/resources/cpp/app/benchmark.hpp

Note that some of those modes produce a lot more benchmarking details and lets you see how much time e.g. cpu -> fpga data movement took. Having a look here may help understand how to parse the detailed benchmarking data:

https://github.com/kathjo/bismo/blob/dynamic_fetch/doc/instrumentation.md

yunchenlo commented 4 years ago

I am using LD_LIBRARY_PATH=$(pwd) ./testapp i , which is interactive mode for benchmarking. Thank you for such professional and detailed answer! I did look into those files before. I will definitely look into those files more closely again and maybe double check my understanding with you later.

Please feel free to close this issue!

Sincerely, Yun-Chen