Idein / qmkl6

BLAS library for VideoCore VI QPU (Raspberry Pi 4)
BSD 3-Clause "New" or "Revised" License
67 stars 7 forks source link

Backport for videocore 4 #13

Open marcusnchow opened 2 years ago

marcusnchow commented 2 years ago

Hello, I am a Ph.D. Student at UC RIverside doing research in ML on edge devices. I noticed that your qmkl6 has a full blas library while qmkl for the videocore 6 only has sgemm. Do you plan on making a backport to support videocore 4 devices. If not, what would it take to do such a thing? I know that the ISA is different, but would it be possible to make a compatible version? I am very much interested in your work and would love to learn more! Thanks, Marcus

Terminus-IMRC commented 2 years ago

Thank you for your interest! Currently, I've been focusing on VC6 because VC6 is simply faster than VC4. In addition, we, Idein Inc., have a private ML library that supports a direct computation of NN layers, not through sgemm or other BLAS functions. So, as a company, not all BLAS functions are required to be implemented.

Yes, the ISA is different and we cannot directly port VC6 code to VC4. However, if you have some requests, I'll gladly implement them for VC4.

marcusnchow commented 2 years ago

Ideally, we would need a full blas suite for VC4, as we use some non ML workloads as well for our research. I'm currently interested in getting it running on the pi zero, since that is the cheapest raspberry pi and i've been working on a performance/cost analysis for various embedded systems.

Does you Company plan on open sourcing its ML library at all? Our lab, SOCAL would be interested in collaborating.

Terminus-IMRC commented 2 years ago

There are so many, and some of them (e.g. stride, diagonal, or sparse matrices) are not feasible for VC4 because VC4 QPU writes to memory in the unit of 64 bytes. So could you tell me which workload are you measuring?

I heard we're not planning, but, if you're interested, could you tell us the details of your research through https://idein.jp/en/contact/ ?

marcusnchow commented 2 years ago

Yeah, i know :( I think most of our workloads are matrix-matrix operations so level-3 blas. But ill have to double check everything. Right now, we are trying to get pytorch runnining with Arms Neon Simd Units. But wanted to compare against using the qpu.

Also, I've sent an email through your contacts, so we can discuss further through there if you like.

Terminus-IMRC commented 2 years ago

Thank you for telling us the details of your projects. Unfortunately, we decided not to publish our ML codes because they are one of the most pioneering projects in our products.

Instead, can your project be done by using Actcast? It offers MobileNet v3 inference on QPU without charge for demo purposes. Actcast can take some actions when specified conditions are met (of which mechanism is called Act), which I think will be a good candidate for a sensor network.

For CPU, yes, you can use PyTorch or TensorFlow Lite for ML, and OpenBLAS is the fastest among the BLAS libraries IIRC. If you need more BLAS functions for QPU, or if you have questions, please let me know.

marcusnchow commented 2 years ago

I think we could use Actcast, at least for the benchmarking portion of our project. Does your MobileNet work on the Pi Zero? I'll start to play around with it.

Regarding the BLAS for VC4, If its not too much to ask, would it be possible for you to develop axpy, gemm, and gemv as an example? Then I could take over and develop others that we might need. I am familiar with GPU programming, but not how the QPU driver works, so an example would help a lot.

Terminus-IMRC commented 2 years ago

Yes, the app runs on all models of Raspberry Pis.

I see, QMKL, the BLAS for VC4, already includes gemm, so I'm going to develop axpy and gemv. QMKL and QMKL6 are made compatible with the other ordinal BLAS libraries (especially Intel MKL, which is freely available). In addition, QMKL and QMKL6 both have some example codes under "test" directory, so please consult them for usage.