Closed samashu007 closed 5 years ago
Hard to tell, how is your platform configured ? (Which raspberry pi is it ? What CPUs are enabled ? At what frequency ?)
Also Raspbian might be using soft-float instead of hard-float as it's compiled for armv6 instead of armv7.
Finally, you might want to call conv0->run()
once before you start timing (Some weights reshaping is happening during the first run() which makes it a lot slower than the subsequent calls)
Attaching the outputs of the following commands:
Also, the max and minimum clock frequencies are 600 MHz and 1200 MHz.
The arm-compute library was build using arch=armv7a.
Also, upon executing the conv->run() once before timing, the results have improved a little. The original implementation (pure C) takes 11.2ms for 100 iterations. The implementation on ARM NEON takes 30.58 ms for 100 iterations. Though results have improved, yet this is unlikely to happen.
For a 3x3x3x3 convolution on a 50x50x3 image, I am getting a reduction in time by a factor of 2. Original Time taken: 1.001 sec (100 iterations) NEON Implementation: 0.510 (100 iterations) However, the time is supposed to get reduced by a factor of 6 or 7 in case of NEON Implementation. Kindly suggest further steps for this.
Any suggestions? Please?
I modified the configure() function, and forcibly made it choose the direct convolution method, instead of GEMM. The results have drastically improved. For a 3x3x3x3 convolution on a 50x50x3 image, it is: 0.043 seconds (NEON Implementation) 1.083 seconds (Original C testbench code)
Is such a drastic improvement of 25x correct?
Hi @samashu007
I believe your input is too small to see the drastic improvement in performance you expect.
I think you'll see bigger improvements if you increase the number of channels to 256 for example and set the convolution method to GEMM.
Hope this helps.
That worked. Thanks. :) There is another issue. On running the same conv->run() function say some 10 times, it gives different results. The difference in not that significant in some, maybe a few microseconds, but in some it is changing by a factor of 2. Is this a bug? What could be the possible reasons?
Hi @samashu007
Hope this helps.
On the remote machine, through which I am cross-compiling, there are other processes running, but not on the Raspberry Pi itself. Also I built ComputeLibrary using 'scons Werror=1 debug=0 asserts=0 neon=1 opencl=0 examples=1 build=native -j2'. Should I remove the '-j2' part to build it on a single thread?
Please reply?
@samashu007
Building with -j2 is fine. I was referring to the number of threads the function uses at run-time when you execute the example. You can try experimenting with different number of threads when running the example and see what effect this has.
Also consider that the first run will be slower than the next ones as there is some extra work to do reshaping matrices.
Hope this helps.
@samashu007 as @morgolock suggested check that your cpus run in a fixed frequency and that your governor is set for example to performance
. Otherwise your frequency might scale for a variety of reasons. You can find online ways to do so.
I tried doing the same. The cpu runs at the max frequency of 1.2Ghz, with GOVERNOR="performance". There is no improvement. The results are still varying on every execution.
Hi @samashu007,
have you tried to measure the execution time of graph examples? For instance what performance do you get running squeezenet or alexnet?
Thanks
Hi @samashu007
Could you please try to measure the time it takes each call to conv0->run();
. If there are big differences it's likely to one of the reasons mentioned above: system load, power saving, thread scheduling policy.
The shapes in your test are too small to see big performance gains in ACL's Neon code, I'd suggest you increase the channels of the input tensor.
I'll close this issue. Please create a new one for performance discussion if you have doubts.
Output of 'strings libarm_compute.so | grep arm_compute_version': arm_compute_version=v18.03 Build options: {'arch': 'armv7a', 'opencl': '0', 'neon': '1', 'examples': '1', 'asserts': '0', 'debug': '0', 'os': 'linux', 'Werror': '1'} Git hash=02c62c8030e7aca592b294396556a93c6bfb9f7a
Platform: Raspberry Pi Operating System: Raspian OS
Problem description: I am testing the performance of ARM NEON for a 2x2x3x3 convolution kernel on a 5x5x3 image. I modified the neon_cnn.cpp file, setting the parameters as desired. Also, I deleted the functions for conv1, act1, fc0 and softmax as I was only interested in a simple 2x2 convolution for a 5x5x3 image. I am profiling the modified function for time measurement. Specifically, I have profiled across the conv0->run() function as it only performs the required convolution (the additions and multiplications). the conv->run() function runs in a loop (no. of iterations = 100) for testing purposes. The results of time profiling as described above are coming out to be 23.237ms. Isnt it too much? The same 2x2 convolution on a 5x5 image is performed at a much faster rate without using NEON functions i.e. an independent C code. What are the reasons for such discrepancies in results?
Here is the code (modified neon_cnn.cpp) which implements the convolution on ARM NEON: