Samsung / ONE

On-device Neural Engine
Other
428 stars 152 forks source link

Run PyTorch model with NNAPI of ONERT #5240

Open hyunsik-yoon opened 3 years ago

hyunsik-yoon commented 3 years ago

Last month, PyTorch team announced that PyTorch model can now run on Android nnapi. It seems that this feature is still a prototype and not an official release.

https://pytorch.org/blog/prototype-features-now-available-apis-for-hardware-accelerated-mobile-and-arm64-builds/#nnapi-support-with-google-android

I will try to run a PyTorch model on NNAPI of ONERT and share the experience.

hyunsik-yoon commented 3 years ago
hyunsik-yoon commented 3 years ago

Preparing a model

$ pip install --upgrade --pre --find-links https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html torch==1.8.0.dev20201106+cpu torchvision==0.9.0.dev20201107+cpu`
$ mkdir ~/mobilenetv2-nnapi/
$ python3 pytorch_nnapi_mobilenet.py  # code in the above web site
...
$ $ ll ~/mobilenetv2-nnapi/
total 44052
drwxrwxr-x  2 eric eric     4096 12월  2 11:52 ./
drwxr-xr-x 78 eric eric     4096 12월  2 11:52 ../
-rw-rw-r--  1 eric eric  4206124 12월  2 11:52 mobilenetv2-quant_core-cpu.pt
-rw-rw-r--  1 eric eric  4195568 12월  2 11:52 mobilenetv2-quant_core-nnapi.pt
-rw-rw-r--  1 eric eric  3754412 12월  2 11:52 mobilenetv2-quant_full-cpu.pt
-rw-rw-r--  1 eric eric  3740848 12월  2 11:52 mobilenetv2-quant_full-nnapi.pt
-rw-rw-r--  1 eric eric 14593598 12월  2 11:52 mobilenetv2-quant_none-cpu.pt
-rw-rw-r--  1 eric eric 14601862 12월  2 11:52 mobilenetv2-quant_none-nnapi.pt

$ file ~/mobilenetv2-nnapi/mobilenetv2-quant_core-nnapi.pt 
/home/eric/mobilenetv2-nnapi/mobilenetv2-quant_core-nnapi.pt: Zip archive data

$ find mobilenetv2-quant_core-nnapi/code/torch/ mobilenetv2-quant_core-nnapi/code/torch/ mobilenetv2-quant_core-nnapi/code/torch/_torch_mangle_2174.py.debug_pkl mobilenetv2-quantcore-nnapi/code/torch/torch_mangle_2174.py mobilenetv2-quant_core-nnapi/code/torch/torch mobilenetv2-quant_core-nnapi/code/torch/torch/classes mobilenetv2-quant_core-nnapi/code/torch/torch/classes/_nnapi.py mobilenetv2-quant_core-nnapi/code/torch/torch/classes/_nnapi.py.debug_pkl mobilenetv2-quant_core-nnapi/code/torch/torch/backends mobilenetv2-quant_core-nnapi/code/torch/torch/backends/_nnapi mobilenetv2-quant_core-nnapi/code/torch/torch/backends/_nnapi/prepare.py.debug_pkl mobilenetv2-quant_core-nnapi/code/torch/torch/backends/_nnapi/prepare.py mobilenetv2-quant_core-nnapi/code/torch/torch/nn mobilenetv2-quant_core-nnapi/code/torch/torch/nn/modules mobilenetv2-quant_core-nnapi/code/torch/torch/nn/modules/container mobilenetv2-quant_core-nnapi/code/torch/torch/nn/modules/container/_torch_mangle_2173.py.debug_pkl mobilenetv2-quantcore-nnapi/code/torch/torch/nn/modules/container/torch_mangle_2173.py mobilenetv2-quant_core-nnapi/code/torch/torch/nn/quantized mobilenetv2-quant_core-nnapi/code/torch/torch/nn/quantized/modules.py mobilenetv2-quant_core-nnapi/code/torch/torch/nn/quantized/modules.py.debug_pkl



it seems that py files are loader or wrappers and maybe mobilenet is serialized inside pickled files.
hyunsik-yoon commented 3 years ago

Before testing nnapi model generated in the above, I tried PyTorch Mobile first to get used to their mobile execution environment.

PyTorch Mobile

After running the app installed on Galaxy S10, the following error happened:

com.facebook.jni.CppException: version_ <= kMaxSupportedFileFormatVersion INTERNAL ASSERT FAILED at
 ../caffe2/serialize/inline_container.cc:132, please report a bug to PyTorch. Attempted to read a PyTorch file with 
version 3, but the maximum supported version for reading is 2. Your PyTorch installation may be too old. (init at
 ../caffe2/serialize/inline_container.cc:132)

Looking into build.gradle, it sounds like we need Torch version 1.4. I created venv for PyTorch 1.4.

$ pip install torch==1.4.0+cpu torchvision==0.5.0+cpu -f https://download.pytorch.org/whl/torch_stable.html

But the problem still remains.

Let's go back to nnapi again and see this issue also happens.

hyunsik-yoon commented 3 years ago

Running model for nnapi

On Android device:

y2s:/data/local/tmp/pytorch # ./speed_benchmark_torch --pthreadpool_size=1 --model=mobilenetv2-quant_full-nnapi.pt --use_bundled_input=0 --warmup=5 --iter=200                            
Starting benchmark.
Running warmup runs.
Main runs.
Main run finished. Microseconds per iter: 25635.7. Iters per second: 39.0081

This runs with Android default nn api.

Now, let's figure out how we can switch default Android nn to ONERT.

hyunsik-yoon commented 3 years ago

Running pytorch model (for nnapi) on ONERT

134|y2s:/data/local/tmp/pytorch # ONERT_LOG_ENABLE=1 LD_LIBRARY_PATH=../nmt/Product/lib ./speed_benchmark_torch --pthreadpool_size=1 --model=mobilenetv2-quant_full-nnapi.pt --use_bundle>
[EXCEPTION] Conv2D: unsupported input operand count
[NNAPI::Model] addOperation: Fail to add operation
terminating with uncaught exception of type std::runtime_error: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/torch/backends/_nnapi/prepare.py", line 28, in __setstate__
    self.training = False
    self.nnapi_module = nnapi_module
    _0 = (self.nnapi_module).init()
          ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    return None
class NnapiModule(Module):
  File "code/__torch__/torch/backends/_nnapi/prepare.py", line 105, in init
    comp = __torch__.torch.classes._nnapi.Compilation.__new__(__torch__.torch.classes._nnapi.Compilation)
    _21 = (comp).__init__()
    _22 = (comp).init(self.ser_model, self.weights, )
           ~~~~~~~~~~ <--- HERE
    self.comp = comp
    return None

Traceback of TorchScript, original code (most recent call last):
  File "/home/eric/venv/pytorch-nightly/lib/python3.6/site-packages/torch/backends/_nnapi/prepare.py", line 36, in init
        self.weights = [w.contiguous() for w in self.weights]
        comp = torch.classes._nnapi.Compilation()
        comp.init(self.ser_model, self.weights)
        ~~~~~~~~~ <--- HERE
        self.comp = comp
RuntimeError: [enforce fail at nnapi_model_loader.cpp:233] result == ANEURALNETWORKS_NO_ERROR. 

Aborted (core dumped) 
glistening commented 3 years ago

@hyunsik-yoon I am curious to compare PyTorch NNAPI's performance (Microseconds per iter: 25635.7) with TensorFlow Lite's.

So I did experiment with assumption:

Here is tflite's result on my GS20P:

$ ./benchmark_model --graph=mobilenet_v2_1.0_224_quant.tflite
Inference (avg): 10603.4

$ ./benchmark_model --use_gpu=1 --graph=mobilenet_v2_1.0_224_quant.tflite
Inference (avg): 8461.58

TensorFlow Lite (default) is about 2.4x faster.

hyunsik-yoon commented 3 years ago

@glistening One thing I'd like to mention is that whole sequence of this issue follows the instruction on https://pytorch.org/tutorials/prototype/nnapi_mobilenetv2.html, where the last step is to run the pytorch nnapi benchmark on Android. The purpose is to run any program that runs PyTorch model on on nnapi of ONERT, not benchmark itself.

BTW, the result you mentioned is interesting. 25635.7 was performance of pre-loaded nnapi (not ONERT) on GS20P (Exynos) device and seemed somewhat slow, compared to TFLite. :-O

glistening commented 3 years ago

@hyunsik-yoon

One thing I'd like to mention is that whole sequence of this issue follows the instruction on https://pytorch.org/tutorials/prototype/nnapi_mobilenetv2.html, where the last step is to run the pytorch nnapi benchmark on Android. The purpose is to run any program that runs PyTorch model on on nnapi of ONERT, not benchmark itself.

I am sorry for putting the result beyond this issue. I just want to ensure my assumption that PyTorch NNAPI backend will be slower than TensorFlow Lite by quick running. I don't want to make another issue since I will not put my time on this issue.

BTW, the result you mentioned is interesting. 25635.7 was performance of pre-loaded nnapi (not ONERT) on GS20P (Exynos) device and seemed somewhat slow, compared to TFLite. :-O

I thought PyTorch would provide its own NNAPI implementation based on PyTorch execution engine (or backend). However, it seeming used the NNAPI implementation on Android machine. Then, I will bet it would be clearly slower than tflite.

hyunsik-yoon commented 3 years ago

@glistening

I thought PyTorch would provide its own NNAPI implementation based on PyTorch execution engine (or backend). However, it seeming used the NNAPI implementation on Android machine.

Correct.