doe300 / VC4CL

OpenCL implementation running on the VideoCore IV GPU of the Raspberry Pi models
MIT License
726 stars 79 forks source link

Explanation for performance gap #91

Open ThomasDebrunner opened 4 years ago

ThomasDebrunner commented 4 years ago

I am curious about performance measurements / theoretical performance numbers. The often stated theoretical performance of the VideoCore IV is 24GFLOPS.

The author of py-videocore manages to get to 8.32 GFLOPS with hand-optimized code: https://qiita.com/9_ties/items/e0fdd165c1c7df6bb8ee

The fastest claimed measurement with clpeak using VC4CL is also just above 8 GFLOPS. On my raspberry pi, I measure about 6.3 GFLOPS.

So even a synthetic benchmark, and hand-optimized code can only reach about one third of the theoretical performance. For Desktop GPUs, clpeak mostly finds about the same performance as stated by the manufacturer. Where does this large performance gap come from?

pfoof commented 4 years ago

I gained at most 13.62 GFLOP/s for large count of loop iterations in FlopsCL and with float16. One of the important aspects is to balance kernel length and iterations.

I have many measurements done but will publish them at most in October.

doe300 commented 4 years ago

So one big factor are the ALUs. You only get the full 24GFLOPS if you utilize both ALUs for any clock cycle! Since the multiplication ALU does not have that many opcodes, it is definitively not utilized that much.

And of course the other problem will be the memory bandwidth. Compared to the fairly powerful computation power, the memory interfaces are very slow.

And as @pfoof hinted (I think), too big kernel code (or branch skipping too many instructions) might also lead to cache misses loading the instructions. But I don't have any numbers for that.

pfoof commented 3 years ago

Hey @doe300, I couldn't find any other contact to you and I would like to share my research for master thesis on VC4CL: https://www.researchgate.net/publication/346000679_Performance-energy_energy_benchmarking_of_selected_parallel_programming_platforms_with_OpenCL

doe300 commented 3 years ago

@pfoof, very interesting read, thanks for sharing!

I would have hoped the Raspberry Pi fares better with power/computation, but I guess I just have to try to improve the performance :wink:

I definitively have to look at your thesis in more details, especially at the detailed benchmarks, result interpretations and comparisons between Raspberry Pi CPU and GPU performance! One thing I can alread take away: The result of section 4.4. Fibonacci adder suggests that the instruction cache misses (or general the instruction fetching) has a far greater performance impact than I thought. Definitively something I should take a look at.