ctuning / ck-tensorflow

Collective Knowledge components for TensorFlow (code, data sets, models, packages, workflows):
http://cKnowledge.org
BSD 3-Clause "New" or "Revised" License
93 stars 26 forks source link

TF-Lite GPU benchmark results? #91

Open mrgloom opened 6 years ago

mrgloom commented 6 years ago

Are any TF-Lite GPU benchmark results for mobile phone are available?

psyhtest commented 6 years ago

As far as I know, TFLite only provides GPU acceleration via AndroidNN, which is available from Android 8.1. Unfortunately, the latest phones we have only support Android 8.0. If someone has a newer phone, we can provide instructions on how to benchmark TFLite there (specifically, MobileNets we are contributing to MLPerf).

mrgloom commented 6 years ago

Thanks for clarification.

For example checking ARM Mali-T830 in GPU dropbox show me benchmarks that are all on CPU and OpenCL(as far as I can see in Crowd scenario column), is that lack of data or none of DNN frameworks support GPU on Android? http://cknowledge.org/repo/web.php?template=cknowledge&action=index&module_uoa=wfe&native_action=show&native_module_uoa=program.optimization

Also I have found this ai benchmark for android smartphones: http://ai-benchmark.com/ranking.html#ranking But information about GPU and DNN framework is not available (maybe we can softly assume that >= Android 8.1 is use GPU).

gfursin commented 6 years ago

Hi @mrgloom .

If I am correct, we had time to add 2 scenarios with GPU: Caffe (OpenCL version) and ArmCL: https://github.com/ctuning/ck-crowd-scenarios/tree/master/experiment.scenario.mobile . Note that our OpenCL versions work exclusively on GPU (I believe that we force it in scenarios - @psyhtest, can you please confirm?), so if you see OpenCL, you can assume that this scenario ran on GPU.

I also guess that there is just a lack of data if you don't see many GPU points - this Android app was run by volunteers but we are not advertising it too much now. It was a proof-of-concept project and we are now trying to build a more user-friendly way of adding scenarios on top of our low-level CK plugins.

However, maybe you can try to run it on your newer mobile and see if these GPU scenarios are still working (Caffe OpenCL and ArmCL). You can get Android app here: http://cknowledge.org/android-apps.html . Please, tell us if it works or not - I will be curious to see the results

Thank you very much for your feedback!

mrgloom commented 6 years ago

I have successfully run the app on smartphone with android 8.0.0

Here is the list with comments:

  1. ArmCL 18.05 OpenCL: MobileNets v1 0.25 128 (Looks strange that it have size of 141 Mb)
  2. Caffe CPU v2 SqueezeNet 1.1 (36 Mb, but in my experiments in Caffe SqeezeNet v1.1 should be 2.9Mb)
  3. Caffe OpenCL: SqueezeNet 1.1
  4. TFlite CPU: MobileNets v1 0.25 128

In my benchmarks TFLite CPU faster then ArmCL(for MobileNets v1 0.25 128) and Caffe CPU faster then Caffe OpenCL(for SqueezeNet 1.1): http://cknowledge.org/repo/web.php?template=cknowledge&action=index&module_uoa=wfe&native_action=show&native_module_uoa=program.optimization Also the problem is that all frameworks don't share at least one model, so I can't compare them directly.

psyhtest commented 6 years ago

Also the problem is that all frameworks don't share at least one model, so I can't compare them directly.

Now you can! Please take a look at our brand new dashboard functionality for the MobileNets implementations (which we are contributing to MLPerf Inference): http://cknowledge.org/dashboard

The default workflow "MobileNets (highlights)" currently shows MobileNets v1/v2 with TFLite 0.1.7 on Firefly RK3399 and Linaro HiKey960, as well as best points for MobileNets v1 with Arm Compute Library v18.08 on HiKey960 (which can serve as a vendor submission example).

By default, the X dimension shows the minimum execution time per image, while the Y dimension shows the the Top1 accuracy. To the right of the workflow name is an icon to invoke additional settings where you can filter out and customise pretty much everything! For example, the Color dimension shows "Image classification rate (maximum, per second)" by default. The fastest point (MobileNets v1-0.25-128, TFLite, HiKey960) is red as it peaks at 161 images per second. If you change the Color dimension to "Image classification efficiency (maximum, per second per Watt)", you will see three red points at 17-18 images per second per Watt. Interestingly, RK3399 is a bit more efficient than HiKey960 here (at least, with the peak power values that I plucked from thin air for each platform).

The workflow "MobileNets (all)" (select from the dropdown menu) includes all ArmCL points exploring available options for the convolution method, data layout and kernel tuner choices. You can discern these options on the plot thanks to the Marker overlay dimension. In the default workflow, you can only see the convolution method. Conveniently, dots over polygons mark GPU points, which are faster than corresponding CPU points except for least accurate models.

Have fun!

... and please let us know if you have any questions or suggestions.

psyhtest commented 6 years ago

ArmCL 18.05 OpenCL: MobileNets v1 0.25 128 (Looks strange that it have size of 141 Mb)

The model itself is only ~2 MB but we bundle together the engine (i.e. the library and the client program). I suspect we include a debug build as we had issues on Android:

For some reason only debug version of the library can be used with this program on Android. When we use release version, the program gets stuck at stage "Preparing ArmCL graph".

The good news is that the same engine is reused across all ArmCL OpenCL MobileNets samples. This means that if you add any other such sample model, you will only need to download a few MB of extra weights.

/cc @Chunosov

psyhtest commented 6 years ago

In my benchmarks TFLite CPU faster then ArmCL(for MobileNets v1 0.25 128) and Caffe CPU faster then Caffe OpenCL(for SqueezeNet 1.1)

That's expected for very small models. There's simply not enough work to keep the GPU busy, and CPU caching works well. However, if you look at the MobileNets highlights, most GPU points (with dots) lie on the Pareto-optimal frontier: for any such point, to improve speed (move left), you need to loose accuracy (move down); similarly, to improve accuracy (move up), you need to loose speed (move right).

mrgloom commented 6 years ago

Seems like Firefly RK3399 and Linaro HiKey960 not a real consumer phone.

Also seems google also have benchmark results for single phone(Pixel 1) for MobileNet variants and ShuffleNet. https://github.com/tensorflow/models/tree/master/research/slim/nets/mobilenet Here is also some relation between models https://www.tensorflow.org/lite/performance/best_practices Also TFLite seems have their own benchmark tool: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/benchmark

psyhtest commented 6 years ago

While HiKey960 is a development board, it has the same chip (Hisilicon Kirin960) that Huawei used in their several popular phones (including Mate 9 Pro and P10). I have results from a real Mate 10 Pro too.

The graph in that repo is from the original MobileNets v2 paper but it's very crude: you can only guess which model is shown and estimate its peformance (e.g. ±1 ms) and accuracy (e.g. ±1%). Besides, it's very hard to reproduce: it's taken us several weeks to understand how to load the weights, how to preprocess the inputs and interpret the outputs. But now anyone can run experiments across many platforms, under different conditions, try different datasets and so on.

You would be very welcome to contribute your experimental data to the dashboard.

psyhtest commented 6 years ago

I've added TFLite results on Huawei Mate 10 Pro (HiSilicon Kirin 970) and Samsung Galaxy S8 US (Qualcomm Snapdragon 835). You may want to filter the results by Library=tflite-0.1.7, Version=1 and set the Color dimension to Platform. If you then look at individual models (e.g. v1-1.00-224), you will see that generally:

Note, however, that the Linux devices (HiKey960 and RK3399) had the CPU frequencies set to the maximum, while the Android devices (Mate 10 Pro and Galaxy S8 US) were non-rooted, so the CPU frequencies were managed automatically.

mrgloom commented 6 years ago

Looks good, but it will be great if anyone can share link with current 'view' of dashboard. Something like http://cknowledge.org/dashboard/mlperf.mobilenets&library=tflite-0.1.7&model=v1-1.00-128

Also does peak memory usage is stored somewhere in benchmark logs?

Are .tflite models are available for direct download? I want to test them locally with https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/benchmark

Update: Look like tensorflow also have a tool to measure accuracy: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/tools/accuracy/README.md

psyhtest commented 5 years ago

it will be great if anyone can share link with current 'view' of dashboard

Thanks for your feedback! Yes, supporting links with settings is on our roadmap.

Also does peak memory usage is stored somewhere in benchmark logs?

Not at the moment. Storing would be easy, but we need to know how to measure this reliably. Do you have any suggestions?

Are .tflite models are available for direct download?

Of course, the links are provided in the MobileNets-v1 and MobileNets-v2 README files, so you can download them directly e.g.:

anton@diviniti:/tmp$ wget https://storage.googleapis.com/mobilenet_v2/checkpoints/mobilenet_v2_0.35_96.tgz
--2018-12-03 12:04:40--  https://storage.googleapis.com/mobilenet_v2/checkpoints/mobilenet_v2_0.35_96.tgz
Resolving storage.googleapis.com (storage.googleapis.com)... 216.58.201.16, 2a00:1450:400c:c06::80
Connecting to storage.googleapis.com (storage.googleapis.com)|216.58.201.16|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 37815375 (36M) [application/x-tar]
Saving to: ‘mobilenet_v2_0.35_96.tgz’

mobilenet_v2_0.35_96.tgz                           100%[================================================================================================================>]  36.06M  18.3MB/s    in 2.0s

2018-12-03 12:04:42 (18.3 MB/s) - ‘mobilenet_v2_0.35_96.tgz’ saved [37815375/37815375]

anton@diviniti:/tmp$ tar xvzf mobilenet_v2_0.35_96.tgz
./
./mobilenet_v2_0.35_96_info.txt
./mobilenet_v2_0.35_96_frozen.pb
./mobilenet_v2_0.35_96_eval.pbtxt
./mobilenet_v2_0.35_96.ckpt.data-00000-of-00001
./mobilenet_v2_0.35_96.ckpt.index
./mobilenet_v2_0.35_96.tflite
./mobilenet_v2_0.35_96.ckpt.meta

As I explained above, however, you then need to perform many manual steps (which CK does behind the scenes).

Also note that the TFLite Model Benchmarking Tool uses random data, so cannot be used to measure accuracy.

mrgloom commented 5 years ago

Also a question are tflite models are benchmarked in single theaded mode?

psyhtest commented 5 years ago

are tflite models are benchmarked in single theaded mode?

In the default mode which happens to be multithreaded.

By the way, I think part of the variation in the results is due to thread migration between big and LITTLE cores. We are planning to set up thread affinity to reduce the variation.

mrgloom commented 5 years ago

What is default mode? Looks like by default num_threads = 4, but I'm not sure. https://github.com/tensorflow/tensorflow/blob/45c3bd5af035508ddadaf114e63ad8a01114d275/tensorflow/lite/kernels/eigen_support.cc#L79 https://github.com/tensorflow/tensorflow/issues/20187

psyhtest commented 5 years ago

Sounds about right. Most high end mobile chips have 4 big cores, so if the 4 threads get allocate to those, you should get good enough performance.

As I mentioned, tuning the number of threads and how they are pinned to cores (thread affinity) is something we want to do in the future.