Run PyTorch model with NNAPI of ONERT

hyunsik-yoon commented 3 years ago

Last month, PyTorch team announced that PyTorch model can now run on Android nnapi. It seems that this feature is still a prototype and not an official release.

https://pytorch.org/blog/prototype-features-now-available-apis-for-hardware-accelerated-mobile-and-arm64-builds/#nnapi-support-with-google-android

I will try to run a PyTorch model on NNAPI of ONERT and share the experience.

hyunsik-yoon commented 3 years ago

Let's follow instruction on https://pytorch.org/tutorials/prototype/nnapi_mobilenetv2.html

hyunsik-yoon commented 3 years ago

I installed pytorch after creating virtual env for python3.

Preparing a model

$ pip install --upgrade --pre --find-links https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html torch==1.8.0.dev20201106+cpu torchvision==0.9.0.dev20201107+cpu`
$ mkdir ~/mobilenetv2-nnapi/
$ python3 pytorch_nnapi_mobilenet.py  # code in the above web site
...
$ $ ll ~/mobilenetv2-nnapi/
total 44052
drwxrwxr-x  2 eric eric     4096 12월  2 11:52 ./
drwxr-xr-x 78 eric eric     4096 12월  2 11:52 ../
-rw-rw-r--  1 eric eric  4206124 12월  2 11:52 mobilenetv2-quant_core-cpu.pt
-rw-rw-r--  1 eric eric  4195568 12월  2 11:52 mobilenetv2-quant_core-nnapi.pt
-rw-rw-r--  1 eric eric  3754412 12월  2 11:52 mobilenetv2-quant_full-cpu.pt
-rw-rw-r--  1 eric eric  3740848 12월  2 11:52 mobilenetv2-quant_full-nnapi.pt
-rw-rw-r--  1 eric eric 14593598 12월  2 11:52 mobilenetv2-quant_none-cpu.pt
-rw-rw-r--  1 eric eric 14601862 12월  2 11:52 mobilenetv2-quant_none-nnapi.pt

$ file ~/mobilenetv2-nnapi/mobilenetv2-quant_core-nnapi.pt 
/home/eric/mobilenetv2-nnapi/mobilenetv2-quant_core-nnapi.pt: Zip archive data

I tried to unzip the file and it has the following dir.


$ ll mobilenetv2-quant_core-nnapi/
total 36
drwxrwxr-x 4 eric eric 4096 12월  2 11:55 ./
drwxrwxr-x 5 eric eric 4096 12월  2 11:55 ../
drwxrwxr-x 3 eric eric 4096 12월  2 11:55 code/
-rw-rw-r-- 1 eric eric    4 12월 31  1979 constants.pkl
drwxrwxr-x 2 eric eric 4096 12월  2 11:55 data/
-rw-rw-r-- 1 eric eric 8849 12월 31  1979 data.pkl
-rw-rw-r-- 1 eric eric    2 12월 31  1979 version

$ find mobilenetv2-quant_core-nnapi/code/torch/ mobilenetv2-quant_core-nnapi/code/torch/ mobilenetv2-quant_core-nnapi/code/torch/_torch_mangle_2174.py.debug_pkl mobilenetv2-quantcore-nnapi/code/torch/torch_mangle_2174.py mobilenetv2-quant_core-nnapi/code/torch/torch mobilenetv2-quant_core-nnapi/code/torch/torch/classes mobilenetv2-quant_core-nnapi/code/torch/torch/classes/_nnapi.py mobilenetv2-quant_core-nnapi/code/torch/torch/classes/_nnapi.py.debug_pkl mobilenetv2-quant_core-nnapi/code/torch/torch/backends mobilenetv2-quant_core-nnapi/code/torch/torch/backends/_nnapi mobilenetv2-quant_core-nnapi/code/torch/torch/backends/_nnapi/prepare.py.debug_pkl mobilenetv2-quant_core-nnapi/code/torch/torch/backends/_nnapi/prepare.py mobilenetv2-quant_core-nnapi/code/torch/torch/nn mobilenetv2-quant_core-nnapi/code/torch/torch/nn/modules mobilenetv2-quant_core-nnapi/code/torch/torch/nn/modules/container mobilenetv2-quant_core-nnapi/code/torch/torch/nn/modules/container/_torch_mangle_2173.py.debug_pkl mobilenetv2-quantcore-nnapi/code/torch/torch/nn/modules/container/torch_mangle_2173.py mobilenetv2-quant_core-nnapi/code/torch/torch/nn/quantized mobilenetv2-quant_core-nnapi/code/torch/torch/nn/quantized/modules.py mobilenetv2-quant_core-nnapi/code/torch/torch/nn/quantized/modules.py.debug_pkl



it seems that py files are loader or wrappers and maybe mobilenet is serialized inside pickled files.

hyunsik-yoon commented 3 years ago

Before testing nnapi model generated in the above, I tried PyTorch Mobile first to get used to their mobile execution environment.

PyTorch Mobile

Beta version (link for Android)

To build HelloWorld, ANDROID_HOME should be set. (see here)

$ ANDROID_HOME=/home/eric/dev/one_build_android_sdk  ./gradlew installDebug

After running the app installed on Galaxy S10, the following error happened:

com.facebook.jni.CppException: version_ <= kMaxSupportedFileFormatVersion INTERNAL ASSERT FAILED at
 ../caffe2/serialize/inline_container.cc:132, please report a bug to PyTorch. Attempted to read a PyTorch file with 
version 3, but the maximum supported version for reading is 2. Your PyTorch installation may be too old. (init at
 ../caffe2/serialize/inline_container.cc:132)

Looking into build.gradle, it sounds like we need Torch version 1.4. I created venv for PyTorch 1.4.

$ pip install torch==1.4.0+cpu torchvision==0.5.0+cpu -f https://download.pytorch.org/whl/torch_stable.html

But the problem still remains.

Let's go back to nnapi again and see this issue also happens.

hyunsik-yoon commented 3 years ago

Running model for nnapi

build benchmark program for Android (link)

# in pytorch cloned dir
$ git submodule update --init --recursive # see https://github.com/pytorch/pytorch/issues/45398
$ rm -rf build_android ; ANDROID_HOME=/home/eric/dev/one_build_android_sdk ANDROID_NDK=/home/eric/dev/one_build_android_ndk/ndk BUILD_PYTORCH_MOBILE=1 ANDROID_ABI=arm64-v8a ./scripts/build_android.sh -DBUILD_BINARY=ON
...
$ ll build_android/bin/speed_benchmark_torch
-rwxrwxr-x 1 eric eric 60254512 12월  2 18:40 build_android/bin/speed_benchmark_torch*
$ adb push build_android/bin/speed_benchmark_torch /data/local/tmp/pytorch
build_android/bin/speed_benchmark_torch: 1 file pushed, 0 skipped. 95.8 MB/s (60254512 bytes in 0.600s)
$ adb push ~/mobilenetv2-nnapi/mobilenetv2-quant_full-nnapi.pt /data/local/tmp/pytorch
/home/eric/mobilenetv2-nnapi/mobilenetv2-quant_full-nn...ushed, 0 skipped. 425.9 MB/s (3740848 bytes in 0.008s)

On Android device:

y2s:/data/local/tmp/pytorch # ./speed_benchmark_torch --pthreadpool_size=1 --model=mobilenetv2-quant_full-nnapi.pt --use_bundled_input=0 --warmup=5 --iter=200                            
Starting benchmark.
Running warmup runs.
Main runs.
Main run finished. Microseconds per iter: 25635.7. Iters per second: 39.0081

This runs with Android default nn api.

Now, let's figure out how we can switch default Android nn to ONERT.

hyunsik-yoon commented 3 years ago

Running pytorch model (for nnapi) on ONERT

There are some code that dynamically loads libneuralnetworks.so.
I provided LD_LIBRARY_PATH=../one/Product/lib but there was an error. It seems like OperationFactory::OperationFactory() generated the error.

134|y2s:/data/local/tmp/pytorch # ONERT_LOG_ENABLE=1 LD_LIBRARY_PATH=../nmt/Product/lib ./speed_benchmark_torch --pthreadpool_size=1 --model=mobilenetv2-quant_full-nnapi.pt --use_bundle>
[EXCEPTION] Conv2D: unsupported input operand count
[NNAPI::Model] addOperation: Fail to add operation
terminating with uncaught exception of type std::runtime_error: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/torch/backends/_nnapi/prepare.py", line 28, in __setstate__
    self.training = False
    self.nnapi_module = nnapi_module
    _0 = (self.nnapi_module).init()
          ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    return None
class NnapiModule(Module):
  File "code/__torch__/torch/backends/_nnapi/prepare.py", line 105, in init
    comp = __torch__.torch.classes._nnapi.Compilation.__new__(__torch__.torch.classes._nnapi.Compilation)
    _21 = (comp).__init__()
    _22 = (comp).init(self.ser_model, self.weights, )
           ~~~~~~~~~~ <--- HERE
    self.comp = comp
    return None

Traceback of TorchScript, original code (most recent call last):
  File "/home/eric/venv/pytorch-nightly/lib/python3.6/site-packages/torch/backends/_nnapi/prepare.py", line 36, in init
        self.weights = [w.contiguous() for w in self.weights]
        comp = torch.classes._nnapi.Compilation()
        comp.init(self.ser_model, self.weights)
        ~~~~~~~~~ <--- HERE
        self.comp = comp
RuntimeError: [enforce fail at nnapi_model_loader.cpp:233] result == ANEURALNETWORKS_NO_ERROR. 

Aborted (core dumped)

glistening commented 3 years ago

@hyunsik-yoon I am curious to compare PyTorch NNAPI's performance (Microseconds per iter: 25635.7) with TensorFlow Lite's.

So I did experiment with assumption:

mobilenetv2-quant_full-nnapi.pt does almost same to mobilenet_v2_1.0_224_quant.tflite.
You used GS20P (Galaxy S20 Plus) Snapdragon.

Here is tflite's result on my GS20P:

$ ./benchmark_model --graph=mobilenet_v2_1.0_224_quant.tflite
Inference (avg): 10603.4

$ ./benchmark_model --use_gpu=1 --graph=mobilenet_v2_1.0_224_quant.tflite
Inference (avg): 8461.58

TensorFlow Lite (default) is about 2.4x faster.

hyunsik-yoon commented 3 years ago

@glistening One thing I'd like to mention is that whole sequence of this issue follows the instruction on https://pytorch.org/tutorials/prototype/nnapi_mobilenetv2.html, where the last step is to run the pytorch nnapi benchmark on Android. The purpose is to run any program that runs PyTorch model on on nnapi of ONERT, not benchmark itself.

BTW, the result you mentioned is interesting. 25635.7 was performance of pre-loaded nnapi (not ONERT) on GS20P (Exynos) device and seemed somewhat slow, compared to TFLite. :-O

glistening commented 3 years ago

@hyunsik-yoon

One thing I'd like to mention is that whole sequence of this issue follows the instruction on https://pytorch.org/tutorials/prototype/nnapi_mobilenetv2.html, where the last step is to run the pytorch nnapi benchmark on Android. The purpose is to run any program that runs PyTorch model on on nnapi of ONERT, not benchmark itself.

I am sorry for putting the result beyond this issue. I just want to ensure my assumption that PyTorch NNAPI backend will be slower than TensorFlow Lite by quick running. I don't want to make another issue since I will not put my time on this issue.

BTW, the result you mentioned is interesting. 25635.7 was performance of pre-loaded nnapi (not ONERT) on GS20P (Exynos) device and seemed somewhat slow, compared to TFLite. :-O

I thought PyTorch would provide its own NNAPI implementation based on PyTorch execution engine (or backend). However, it seeming used the NNAPI implementation on Android machine. Then, I will bet it would be clearly slower than tflite.

hyunsik-yoon commented 3 years ago

@glistening

I thought PyTorch would provide its own NNAPI implementation based on PyTorch execution engine (or backend). However, it seeming used the NNAPI implementation on Android machine.

Correct.

Samsung / ONE