ARM GPU/OpenCL optimization

kindloaf commented 7 years ago

Hi, I'm reading the current crowd-source result, and it seems that the GFLOPS it implies is much lower than the advertised GFLOPS of Mali GPUs. From wiki, Mali T880 has about 20 GFLOPS per core. But from the test result, many entries reported that it take ~1 second to run squeezenet (about 1 GFLOP), and a few seconds to run googlenet (about 3 GFLOP). This means Caffe is at about 1GFLOPS during inference.

Any suggestions on how to explain the gap, or how to close it?

gfursin commented 7 years ago

Yes, we've noticed that too. Current Caffe OpenCL version is not yet well optimized, and it seems that most optimization parameters for kernels seem to be tuned for NVIDIA GPUs (there is also an on-going community effort to improve Caffe for Intel and AMD GPUs but very little for Mali).

Note, that the first stage of our CK-Caffe project was to provide a workflow framework to simplify customization of Caffe builds with different compilers and libraries across different platforms (to some extent what Caffe2 is trying to achieve but as a generic CK-based workflow and without changing APIs), and to share open statistics to let the community reproduce and improve performance. We more or less done with this stage and recently started evaluating and optimizing various OpenCL libraries.

For example, a couple of days ago we were tuning auto-tuning CLBlast library on Mali GPUs (various parameters including local work sizes, etc for different data sets) and already see around 3x speedups. However, the problem is that we need different parameters for different matrix sizes, so we need to provide dynamic adaptation which is not there... We added CLBlast to our CK crowd-tuning workflow and will soon start sharing optimization results at cKnowledge.or/repo (and will release the open-source code when stable).

We also plan to evaluate and optimize libDNN and add the new ARM Compute Library. We just have to focus on several other projects during coming months and have very limited spare resources to do extra tuning on specific devices.

Hope it's of any help ...

kindloaf commented 7 years ago

@gfursin Thank you for the detailed information. It's very helpful to me to see the overall plan!

dividiti / ck-caffe

ARM GPU/OpenCL optimization #105