doonny / PipeCNN

An OpenCL-based FPGA Accelerator for Convolutional Neural Networks
Apache License 2.0
1.22k stars 369 forks source link

PipeCNN Performance #103

Open saman-aghazadeh opened 5 years ago

saman-aghazadeh commented 5 years ago

Hi,

Got a question with respect to PipeCNN performance numbers. Based on my experiences, I can get around 300ms/image while running VGG-16 on PipeCNN. Another paper which exactly follows the same approach as PipeCNN, which is called Caffeine, claims to get 20ms/image. Also, other work called DLA, claims to get 1ms/image for Alexnet, while I get 22ms/image. How all these are possible. What exactly they are doing that PipeCNN does not? Any possibility that what they claim is wrong?

Thanks

aazz44ss commented 5 years ago
  1. more DSPs,
  2. higher Fmax,
  3. better algorithm like Winograd
saman-aghazadeh commented 5 years ago

So, based on the PipeCNN design, how can you increase the number of DSPs massively? The only place that uses DSPs, is the core conv. Parameters that can improve it, are VEC_SIZE and LANE_NUM. on Arria 10, I hardly can reach VEC_SIZE=16 and LANE_NUM=32, due to local memory limitations.

With respect to Fmax, other papers are achieving around 200MHz, while we reach 230MHz. For the Winograd, papers like Caffeine are not using them. Also, I don't think something like Winograd can boost the performance 15 times.

I'm kind of worried what they claim is not correct, and since they won't share the code, there is no way for checking their results.

aazz44ss commented 5 years ago

You only use 16x32=512 DSPs, while arria10 can have 3036 DSPs, and remaining ALMs can also use as multiplier. So if someone use all DSPs, he have 6 times faster If he also use winograd4,3, he has additional 2 times faster If he use INT6 as DSPs input, he has additional 2 times faster 6x2x2=24, with same Fmax

doonny commented 5 years ago

No don't say that. If you have carefully read the papers, you will see how they could get that number. One more thing, a throughput of 1000 images/s does not mean that one image is processed in 1 ms. Throughput is different from "speed" or "latency". Careful with the terms used to measure performance.

aazz44ss commented 5 years ago

you can also check DLA with Arria10 on OpenVINO the throughput is even higher than 2017”s paper And I don’t think local memory is limitation if someone have carefully optimize the design.

saman-aghazadeh commented 5 years ago

I totally understand the differences of the throughput and the latency. My confusion is, PipeCNN is the only opensource CNN on FPGA and the performance difference with other related works are huge. So My question will be, there are fundamental differences between the optimized version of CNN on FPGAs and what we have with PipeCNN? Again, based on my understanding, PipeCNN and Caffeine are so similar, while the performance difference for one image is huge (Caffeine reports number on a batch of 1 image). I just would like to know what path should I take with PipeCNN optimization, so I can achieve somehow state of the art performance.

saman-aghazadeh commented 5 years ago

Also one more thing, the DSP utilization is really low, and to increase it we have to apply major changes into PipeCNN on how it performs the parallel computation. In other papers, this utilization of DSPs is pretty high. Since they just cover the fundamental design details of their work, it's not clear how the code can achieve such high DSP utilization.

aazz44ss commented 5 years ago

DSP u% is low may owing to high fanout that increase effort on fitter. You can use systolic array to reduce fanout

saman-aghazadeh commented 5 years ago

Ok. I also had this impression that both DLA and Caffeine are using systolic arrays. DLA uses a 1-D array. So one more question would be (specifically for the Caffeine), are they hard-coding their systolic array design, or it is something that is being inferred by the Xilinx Compiler? They don't discuss how their provided pseudo-code (with all the tilings and other optimizations) are being inferred as a systolic array in their design.

doonny commented 5 years ago

DLA is OpenCL-based, except for the FP16 uint. I have seen parts of the the code.

saman-aghazadeh commented 5 years ago

So, which part of the PipeCNN should be improved to get performance numbers as close as DLA and caffeine? I still assume Caffeine is much closer to PipeCNN compared to DLA. Could you please guide me on this issue? I'm interested to improve PipeCNN for the sake of the community. I would appreciate if you provide me some resource which better describes the architecture for a faster convolution. Something like Systolic array.

doonny commented 5 years ago

My feeling is that the key is to understand the "roofline model". For instance, If you are working on a FPGA that has 256 DSP, the highest perf. you can get is 256x2x2x200MHz = 204.8 GOP/s assuming that you have utilized all the 256 DSP for computation. If you can not get a close perf. to that number, there is too possible reseasons: (1) you are not using all the DSPs for computation or (2) your design is bandwidth limited which prevents you geting close to the computational roof.

If you could identify the problem, then you will have an idea at least what would be your optimization goal, DSP utilization or bandwidth ? However, there is no sample answer on how to do detailed optimization since it is closely related to the HW architecture you are using.