doonny / PipeCNN

An OpenCL-based FPGA Accelerator for Convolutional Neural Networks
Apache License 2.0
1.22k stars 370 forks source link

hardware resource usage become extremely large when VEC_SIZE=16, LANE_NUM=64 #56

Closed aazz44ss closed 6 years ago

aazz44ss commented 6 years ago

when I set VEC_SIZE=8, LANE_NUM=64 the resource usage is ALUTs, FF, RAM, DSP 157019 (20%) | 158729 (10%) | 1103 (44%) | 292 (19%)

the mac function from int mult_add_fix8bx4 (char a0_in, char b0_in, char a1_in, char b1_in, char a2_in, char b2_in, char a3_in, char b3_in) {return (a0_in*b0_in)+(a1_in*b1_in)+(a2_in*b2_in)+(a3_in*b3_in);} 32-bit Integer Add (x640) | 14362| 0 | 0 | 0 |   32-bit Integer Multiply (x512) | 0 | 0 | 0 | 256

this looks correct, because I have 512 multiply, and two multiply share one DSP.

But if I set VEC_SIZE=16, LANE_NUM=64 the resource usage is 819702 (104%) | 622573 (40%) | 1425 (56%) | 386.5 (25%)

the ALUTs and FF become extremely large. the mac function from int mult_add_fix8bx4 (char a0_in, char b0_in, char a1_in, char b1_in, char a2_in, char b2_in, char a3_in, char b3_in) {return (a0_in*b0_in)+(a1_in*b1_in)+(a2_in*b2_in)+(a3_in*b3_in);} 32-bit Integer Add (x9363) | 307289 | 0 | 0 | 0 |   32-bit Integer Multiply (x755) | 0 | 0 | 0 | 377.5 |   And (x8608) | 94688 | 0 | 0 | 0

"Add" hardware usage become very large, and there are a lot of "And". the DSP usage should be 16*64/2=1024/2=512, however it only use 377.5. Is the compiler use "Add" and "And" to create 512-377.5=134.5 multiplier?

Is this the limitation that we can't set VEC_SIZE * LANE_NUM exceed certain number? otherwise the compiler will use ALUTs to route extra multiplier?

or I should use RTL IP if I want to set VEC_SIZE=16, LANE_NUM=64 ?

doonny commented 6 years ago

So far VEC_SIZE and LANE_NUM could not be too large. It has something to do with the routing of the compiler.

aazz44ss commented 6 years ago

my device is Arria 10, opencl sdk 17.0, 17.1 Is this a bug from altera opencl SDK ? because I expect when I double the VEC_SIZE, the DSP usage and "Add" should be double as well. however, the "Add" become extremely large, and DSP usage is abnormal.

myih commented 6 years ago

Hi @aazz44ss I'm testing PipeCNN on Arria 10 GX kit. The resource usage is quite different then your and the paper's, only ~2% DSP is used, and I can't find something like "32-bit or 8-bit Integer Multiply" in area analysis in the report.html, only "32-bit Integer Add (x383)". Is it because the report.html doesn’t show the HDL part? If so, where can I find the real hardware usage?

Also, what's the runtime you achieved with (vector=16 lane=32)? Mine is 14ms (vector=16 lane=32) and 11ms (vector=16 lane=64) for Alexnet.

Thank you

aazz44ss commented 6 years ago

maybe it doesn't show HDL part, the real hardware usage is at top_report, or acl_quartus_report. the report will show hardware usage and kernel frequency as well. you can use kernel frequency, vector number, lane number, model MACs, to calculate the execution time.

doonny commented 6 years ago

check the resource utilization on quartus, not by report.html

manojrohit commented 6 years ago

Hey @myih , How much time did it took you to compile the project into binary with arria10 (vector = 16, lane =64)