Idein / py-videocore6

Python library for GPGPU programming on Raspberry Pi 4
https://idein.jp
GNU General Public License v2.0
247 stars 28 forks source link

[Suggestion] Create some tutorials on how to use py-videocore6 #43

Open filiphhh opened 4 years ago

filiphhh commented 4 years ago

Hi!

I have been following py-videocore6 since you first made it public and py-videocore before that. I'd like to thank you for simplifying QPU-programming on the Raspberry Pi's. I have been trying to better learn how one would go about and better utilize the QPUs but I think it's difficult to find resources for it. I would love to see some tutorials on how to write applications with py-videocore6 and on how to parallellize the programs to fully utilize videocores potential.

Thank you @Terminus-IMRC, @notogawa and Idein for all your hard work!

Terminus-IMRC commented 4 years ago

Thank you for supporting us! We also think we need some tutorials for beginners, but currently, no tutorials exist except for the Japanese one: https://qiita.com/9_ties/items/15ab7fa198991a61a3a9

Because the instruction set of VC6 QPU is very similar to the one of VC4 QPU, you can learn how QPU basically works (add/mul ALU dual-issue, three branch delay slots, TMU unit, etc.) from the VideoCore IV 3D Architecture Reference Guide.

Though there is no publicly available documentation for VC6 QPU, you can gain an understanding of it from working examples. I added some simple example programs to this repository, which may help you when you write VC6 QPU codes:

These codes support multiple-QPU execution up to eight, where you can see how to assign input/output memory area to each QPU.

Also, notogawa added matrix-matrix multiplication code examples/sgemm.py. The innermost loop of this code utilizes in-QPU vector rotation to reduce the number of memory loads/stores.

In conclusion, I recommend you to start writing a primitive program (simple memory read/write or array addition/subtraction/multiplication) by referring to the examples. Then you may consider how to achieve the theoretical 32 [Gflop/s] peak performance (by utilizing the register files and TMU/L2 caches).

Terminus-IMRC commented 4 years ago

We've just released other VC6 QPU examples: https://github.com/Idein/qmkl6

filiphhh commented 4 years ago

Thank you for your quick replies! I would never have found the Japanese tutorial if you didn't share the link to it, Google Translate seems to do a pretty good job at translating it. Great to see QMKL for rpi4 too!

I hope that your libraries will get some more traction in the community as it unlocks a lot more power in these little devices. Thank you for all the low level rpi resources you have produced!