ARM-software / ComputeLibrary

The Compute Library is a set of computer vision and machine learning functions optimised for both Arm CPUs and GPUs using SIMD technologies.
MIT License
2.85k stars 776 forks source link

Examples benchmarking problem in RaspberryPi 3B+ #599

Closed ghimiredhikura closed 5 years ago

ghimiredhikura commented 5 years ago

Output of 'strings libarm_compute.so | grep arm_compute_version':

arm_compute_version=v0.0-unreleased Build options: {'arch': 'arm64-v8a', 'debug': '0', 'benchmark': '1', 'benchmark_tests': '1', 'opencl': '0', 'neon': '1', 'cppthreads': '1', 'Werror': '0'} Git hash=b'05e5644715c678773abaf180222a33959ee0dadb'

Platform: RaspberryPi 3B+ Architecture: aarch64 Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 Vendor ID: ARM Model: 4 Model name: Cortex-A53 Stepping: r0p4 CPU max MHz: 1400.0000 CPU min MHz: 600.0000 BogoMIPS: 38.40 Flags: fp asimd evtstrm crc32 cpuid

Operating System: https://github.com/sakaki-/gentoo-on-rpi3-64bit

Problem description: I am using following build command.

scons arch=arm64-v8a benchmark=1 benchmark_tests=1 opencl=0 neon=1 cppthreads=1 benchmark_tests=1 -j3 Werror=0
export LD_LIBRARY_PATH=build/

Benchmarking alexnet

./build/tests/benchmark_graph_alexnet --pretty-file=alexnet.txt --iterations=20 --example_args="--threads=1" --instruments="wall_clock_timer_ms"

Version = arm_compute_version=v0.0-unreleased Build options: {'arch': 'arm64-v8a', 'debug': '0', 'benchmark': '1', 'benchmark_tests': '1', 'opencl': '           0', 'neon': '1', 'cppthreads': '1', 'Werror': '0'} Git hash=b'05e5644715c678773abaf180222a33959ee0dadb'
CommandLine = ./build/tests/benchmark_graph_alexnet --pretty-file=alexnet.txt --iterations=20 --example_ar           gs=--threads=4 --instruments=wall_clock_timer_ms
Iterations = 20
Running [0] 'Examples/benchmark_graph_alexnet'
Threads : 4
Target : NEON
Data type : F32
Data layout : NHWC
Tuner enabled? : false
Tuner file :
Fast math enabled? : false

  Wall clock/Wall clock time:    AVG=173.9102 ms, STDDEV=0.29 %, MIN=173.0490 ms, MAX=175.0760 ms, MEDIAN=174.0070 ms
Executed 1 test(s) (1 passed, 0 expected failures, 0 failed, 0 crashed, 0 disabled) in 8 second(s)

And here are more benchmark results.

squeezenet_v1_1:      AVG=36.8231 ms, STDDEV=15.15 %, MIN=20.9510 ms, MAX=39.5240 ms, MEDIAN=38.8860 ms
alexnet:              AVG=173.9102 ms, STDDEV=0.29 %, MIN=173.0490 ms, MAX=175.0760 ms, MEDIAN=174.0070 ms
vgg16:                AVG=1107.0216 ms, STDDEV=2.69 %, MIN=1051.7200 ms, MAX=1156.0880 ms, MEDIAN=1110.2560 ms
mobilenet_v2:         AVG=90.0225 ms, STDDEV=13.61 %, MIN=49.3990 ms, MAX=95.8760 ms, MEDIAN=94.8530 ms
resnet50:             AVG=221.2754 ms, STDDEV=7.51 %, MIN=163.8950 ms, MAX=236.7930 ms, MEDIAN=222.1410 ms
googlenet:            AVG=92.1642 ms, STDDEV=0.67 %, MIN=91.3500 ms, MAX=94.0260 ms, MEDIAN=92.1320 ms

The runtime results are way faster and do not seems to be real. Any idea please? Thanks.

Deepak

AnthonyBarbier commented 5 years ago

I'm not sure I understand the problem: which ones don't look realistic to you ?

ghimiredhikura commented 5 years ago

Hello @AnthonyARM,

I am confused about run times. They are way faster. For example, does googlenet runs in 92.16 ms in raspberrypi3b+?

Actually using same settings in Raspbian OS which uses 32-bit (armv7) in raspberrypi3b+ I am getting following benchmark numbers.

squeezenet_v1_1:      AVG=135.3122 ms, STDDEV=0.27 %, MIN=134.6390 ms, MAX=136.2140 ms, MEDIAN=135.2780 ms
alexnet:              AVG=554.4598 ms, STDDEV=9.60 %, MIN=383.4000 ms, MAX=608.2910 ms, MEDIAN=566.2810 ms
vgg16:                AVG=3611.6921 ms, STDDEV=0.60 %, MIN=3595.5750 ms, MAX=3676.1221 ms, MEDIAN=3605.0891 ms
googlenet:            AVG=524.2405 ms, STDDEV=26.09 %, MIN=449.0150 ms, MAX=803.4780 ms, MEDIAN=450.1530 ms
resnet50:             AVG=913.3924 ms, STDDEV=21.45 %, MIN=834.3520 ms, MAX=1517.8990 ms, MEDIAN=836.8410 ms
mobilenet_v2:         AVG=251.6215 ms, STDDEV=25.29 %, MIN=179.2900 ms, MAX=314.9150 ms, MEDIAN=310.1700 ms

Thanks

bonseyes-admin commented 5 years ago

@AnthonyARM Are there any accuracy tests for these models to confirm that the reported benchmark numbers are correct?

GeorgeARM commented 5 years ago

@ghimiredhikura 32-bit runs are slower as we don't have as many optimized kernels as we do for 64-bit, especially for GEMM based kernels.

@bonseyes-admin you can use the graph tests for validating the network, there is an interface for passing a list of images with the expected output and the top1 and top5 accuracy will get reported.

bonseyes-admin commented 5 years ago

@GeorgeARM Thanks - do you have any software release accuracy tests that you could release as it would save a lot of time in porting and testing the inference engine on different hardware platforms. We could write our own accuracy tests however it would be a lot cleaner that you provided an accuracy test for each network and developers could easily check a port to a new platform (device + CPU + OS + compiler etc) is working as expected. Maybe if you have it for only one of the models it would still be useful.

AnthonyBarbier commented 5 years ago

@bonseyes-admin : You don't need to write anything, you can do the validation directly with the graph examples, something like:

LD_LIBRARY_PATH=lib ./bin/graph_alexnet --target=CL --layout=NHWC --type=F32 --threads=4 --validation-range='16666,24998' --validation-file='val.txt' --validation-path='/path/to/test/images/' --data='/path/to/weights/'

Where val.txt is a list of images with expected labels:

val_00000001.JPEG 65
val_00000002.JPEG 970
val_00000003.JPEG 230
val_00000004.JPEG 809
val_00000005.JPEG 516

--validation-range is if you've got a farm of devices and want to parallelise the runs.

Hope this helps.

bonseyes-admin commented 5 years ago

@AnthonyARM Thanks. Sorry by accuracy tests I was referring to the models, test images, and expected outputs as effectively there is little software needed to test the networks. Even if it was for one model, it would be useful to confirm compilation on a new platform looked correct.

You might want to consider releasing your ImageNet models so that developers don't need to try to build up a repository themselves - it would save a lot of time.

psyhtest commented 5 years ago

@AnthonyARM I think the biggest stumbling block for validators would be finding out which weights to use with --data='/path/to/weights/'. For example, where can they download the GoogLeNet weights from?

psyhtest commented 5 years ago

@bonseyes-admin I believe we've got exactly what you need for MobileNets-v1! We adapted the corresponding ArmCL graph example in early 2018 to make a reproducible and reusable artifact for the 1st ACM ReQuEST tournament at ASPLOS'18. The complete Collective Knowledge workflow is available either on GitHub or in the ACM Digital Library.

Furthermore, we have extended this artifact to support TensorFlow (Python), TensorFlow (C++) and TFLite, and contributed these TF extensions to the MLPerf Inference benchmark as the reference MobileNets-v1/v2 code.

Finally, we have provided an interactive dashboard with sample data publicly available for Linaro HiKey960, Firefly RK3399, Samsung Galaxy S8 and Huawei Mate 10 Pro. You can read more about the dashboard here.

Hope this helps. Please do not hesitate to ask if you have any questions or comments.

AnthonyBarbier commented 5 years ago

The Graph API was only added as a stop gap measure while ArmNN was being developed, we're not planning on distributing weights and test images in yet another format when there are already so many existing models out there.

psyhtest commented 5 years ago

@AnthonyARM I agree, and I don't mean that you should. You have enough on your plate with supporting and optimising the library!

I'm just saying that it's indeed possible for the community to build upon your examples and provide complete workflows, as our work on the ReQuEST artifact shows.

Look forward to future collaboration!

bonseyes-admin commented 5 years ago

@AnthonyARM You need to make is easier for your developer community to test and benchmark your inference engine on different platforms not harder.

Our project (https://www.bonseyes.com/outcomes/) has already contributed Winograd convolution optimization to ARMCL (1.4x speed-up improvements of MobileNetV2) and there are more improvements that we can contribute however the testing framework of your library is limiting our ability to contribute.

Obviously releasing your regression tests would help rather than having the community build their own. However I am not referring to building a model zoo (https://github.com/onnx/models) where as you point out no body needs another zoo. However a regression test suite to ensure the combination of Device + OS + Drivers + Compiler has not introduced a accuracy regression is missing in your framework. I think it's unreasonable to expect embedded developers to know the intricacies and nuances of model training and figure out how to interpret the output your API. You should be targetting a wider range of developers, those who "don't know" the difference in padding between a Caffe and Tensorflow models and why the results will different given the "same" model architecture. BTW this is currently a bug in your current release.

Currently we are blocked on the 64bit issue for a publication we are due to publish this month under our H2020 project where we can't get reliable numbers on your library on ARM64 and hence currently your library is 25% slower versus competition (NCNN) on average.

Thanks to consider releasing your internal regression tests as it would allow you to improve your project faster and accept more community contributions for improvement. What we need are:

image

Thanks, Tim

GeorgeARM commented 5 years ago

Hello @bonseyes-admin,

What is the problem with the 64-bit runs? What is causing measurements not being reliable?

bonseyes-admin commented 5 years ago

@GeorgeARM

The fundamental issue is that without debugging in depth and a high degree of knowledge of the neural networks themselves it's impossible to know if the benchmark outputs from your benchmark program are accurate and reliable on a given platform, OS, compiler, etc and / or if there is a compile / run error on our side in execution of the program. We have no way to debug and verify.

For example compiling the latest Master of https://review.mlplatform.org/ml/ComputeLibrary on Linux GenToo 64bit with GCC 8.2

scons arch=arm64-v8a benchmark=1 benchmark_tests=1 opencl=0 neon=1 cppthreads=1 benchmark_tests=1 -j3 Werror=0

And then running your benchmark program does not fail however the results can not be verified by the developer.

./build/tests/${name} --pretty-file=benchmark_results/nthreads1/${name}.txt --iterations=20 --example_args="--threads=1" --instruments="wall_clock_timer_ms"

The output being:

image

Attached our the JSON logs with --instruments=wall_clock_timer_ms,scheduler_timer_ms --json-file=log.txt

The results for ResNet50 on an A53 at 1400MHz according to your program are:

ARMCL GenToo 64bit OS + GCC 8.2 1 thread: 141ms 2 thread: 160ms 4 thread: 200ms

The results reported from the developer are "correct". To me it looks like there is a clear problem.

The fundamental issue here is not our specific issue - it is that a developer without deep knowledge of how fast a certain network should run has no way to determine if your program is producing reliable and accurate results - or if there is an error made by the developer in compiling and running the software.

As a sanity check, we ran the exact same compilation and benchmarking on RPi3B+ 32bit OS and the numbers look reasonable.

ARMCL Raspbian 32bit OS + GCC 6.3 1 thread: 2072ms 2 thread: 1165ms 4 thread: 834ms

Hence you can see clearly from the log files that the 64bit version isn't running the entire ResNet50 network for some reason.

However the fundamental issue remains - how do you ensure the output from your benchmark program is reliable for a developer who doesn't know the details of the networks he is benchmarking? Or is your target audience of the library towards developers who know the details of each network benchmarked and can report issues on a network level?

Hence my issue and solution remains the same: you need to provide accuracy regressions and release your internal regression tests. I'm pretty sure you won't release code to Google unless it passes accuracy regressions on a set of weights and test images and these tests being released would be the benefit of the library and the entire developer community to make contributions back to the project.

log_resnet50_rpib_64bit.txt

log_resnet50_rpi3b_32bit.txt

AnthonyBarbier commented 5 years ago

I understand what you're asking, and I understand our graph examples might not be reliable, however as explained before: for validation of layers in isolation you have our arm_compute_validation test suites, if you want system tests (i.e entire networks), then I believe these should be provided by the official graph level API which in our case is ArmNN / AndroidNN. (And it should be able to load and run networks coming from other frameworks and therefore wouldn't require the creation of a new zoo).

I'm not saying they currently provide these kind of suites, I'm just saying I believe it would be a better place for system level validation.

If it was simple to release the weights and images we would do it, unfortunately from a legal point of view it's far from being straight forward.

In the meantime George is going to try to reproduce the issue internally and we'll update this thread.

bonseyes-admin commented 5 years ago

@AnthonyARM

Yes we've looked at the system level validation approach - however you quickly get pushed down the path of only device and operating specific support of the higher level library i.e Android.

It isn't really maintainable solution for broader usage of ARMCL beyond a back-end for ArmNN. Then you have a whole discussion on a viability of ArmNN itself re the explosion of OS API's in the deep learning stack on mobile.

The nice thing about ArmCL is you can run it anywhere and provides a good ARM CPU implementation independent of the OS (you only need a standard ARM core) where you don't care how the model is loaded and dispatched. Then you have the issue of reporting bugs. Is the issue in ArmNN or in ARMCL? It depends on the developer you ask.

However I can see the legal constrains however maybe you can publish some guides on how to use the public API for testing an accuracy of ARMCL and explain how a developer would generate those files and an expected output Top-1 and Top-5 accuracy of one "example" ImageNet model. From that I think things can be figured out if you can't get this past legal.

We can provide an example Caffe model for ImageNet if you like.

Thanks.

psyhtest commented 5 years ago

@bonseyes-admin

Are you specifically after an example of using Caffe weights? The MobileNets implementation I shared above loads weights converted from TensorFlow. It's allowed us, for example, to detect some discrepancies between ArmCL and TensorFlow.

GeorgeARM commented 5 years ago

Hello @bonseyes-admin,

I had a look in the issue and I can reproduce it. I managed to workaround it as it most likely is a GCC issue, thus results now look sensible. Will notify you once a patch is uploaded in the public server for review. Hope this helps.

GeorgeARM commented 5 years ago

Hello @bonseyes-admin, Created this patch https://review.mlplatform.org/#/c/ml/ComputeLibrary/+/390/ Can you check if this solves any reliability issues that you have?

bonseyes-admin commented 5 years ago

Hi @GeorgeARM Thanks we will check the patch and let you know.

bonseyes-admin commented 5 years ago

Hi @GeorgeARM

Thanks. We have applied the patch and the results look more reasonable. We will look at making an accuracy regression test to confirm the issue is resolved.

image

log_resnet50_patch390.txt

Thanks, Tim

GeorgeARM commented 5 years ago

Closing issue as problem has been resolved. Reopen if needed