Inference is slow on MALI GPUs

gasgallo commented 5 years ago

Before you open an issue, please make sure you have tried the following steps:

Make sure your environment is the same with (https://mace.readthedocs.io/en/latest/installation/env_requirement.html).
Have you ever read the document for your usage?
Check if your issue appears in HOW-TO-DEBUG or FAQ.
The form below must be filled.

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
NDK version(e.g., 15c): 18b
GCC version(if compiling for host, e.g., 5.4.0): 5.4.0
MACE version (Use the command: git describe --long --tags): 0.11.0-rc0
Python version(2.7): 2.7
Bazel version (e.g., 0.13.0): 0.16.0

Model deploy file (*.yml)

# The name of library
library_name: FD
target_abis: [arm64-v8a]
target_socs: [rk3399]
model_graph_format: file
model_data_format: file
models:
  RF: # model tag, which will be used in model loading and must be specific.
    platform: caffe
    # path to your tensorflow model's pb file. Support local path, http:// and https://
    model_file_path: /models/model.prototxt
    weight_file_path: /models/model.caffemodel
    # sha256_checksum of your model's pb file.
    # use this command to get the sha256_checksum --> sha256sum path/to/your/pb/file
    model_sha256_checksum: 81c388e812da37e499da8272eff0d7d140e8ae50dcb8d7e124dbd4e98462ad24
    weight_sha256_checksum: 2250beffe1bc13f96f60b95fa37f48848bb31f567ae9eb763c86496a4ae29c9b
    subgraphs:
      - input_tensors:
          - data
        input_shapes:
          - 1,3,640,480
        input_data_formats:
          - NCHW
        output_tensors:
          - face_rpn_cls_prob_stride128
          - face_rpn_bbox_pred_stride128
          - face_rpn_landmark_pred_stride128
          - face_rpn_cls_prob_stride64
          - face_rpn_bbox_pred_stride64
          - face_rpn_landmark_pred_stride64
          - face_rpn_cls_prob_stride32
          - face_rpn_bbox_pred_stride32
          - face_rpn_landmark_pred_stride32
          - face_rpn_cls_prob_stride16
          - face_rpn_bbox_pred_stride16
          - face_rpn_landmark_pred_stride16
          - face_rpn_cls_prob_stride8
          - face_rpn_bbox_pred_stride8
          - face_rpn_landmark_pred_stride8
          - face_rpn_cls_prob_stride4
          - face_rpn_bbox_pred_stride4
          - face_rpn_landmark_pred_stride4
        output_shapes:
          - 1,2,5,5
          - 1,4,5,5
          - 1,10,5,5
          - 1,2,10,10
          - 1,4,10,10
          - 1,10,10,10
          - 1,2,20,20
          - 1,4,20,20
          - 1,10,20,20
          - 1,2,40,40
          - 1,4,40,40
          - 1,10,40,40
          - 1,2,80,80
          - 1,4,80,80
          - 1,10,80,80
          - 1,2,160,160
          - 1,4,160,160
          - 1,10,160,160
        output_data_formats:
          - NCHW
          - NCHW
          - NCHW
          - NCHW
          - NCHW
          - NCHW
          - NCHW
          - NCHW
          - NCHW
          - NCHW
          - NCHW
          - NCHW
          - NCHW
          - NCHW
          - NCHW
          - NCHW
          - NCHW
          - NCHW
    obfuscate: 0
    runtime: cpu+gpu # cpu, gpu or cpu+gpu or dsp
    winograd: 4

Describe the problem

Inference time on MALI GPUs is very slow compared to other frameworks and a lot slower than the same model running on Adreno GPUs.

To Reproduce

Steps to reproduce the problem:

1. cd /path/to/mace
2. python tools/converter.py convert --config_file=/path/to/your/model_deployment_file
2. python tools/converter.py benchmark --config_file=/path/to/your/model_deployment_file

Error information / logs

Please include the full log and/or traceback here.

LOGs

Additional context

For example, the model running with the above yml file takes:

328ms on a MALI T864
18ms on Adreno 640
233ms on MALI T864 (using Alibaba/MNN)

lydoc commented 5 years ago

I will check this later. Have you ever benchmarked another model like MobileNet? BTW, which backend of MNN did you use? OpenCL or Vulkan?

gasgallo commented 5 years ago

I will check this later. Have you ever benchmarked another model like MobileNet? BTW, which backend of MNN did you use? OpenCL or Vulkan?

Well, my model has a mobilenet backbone with just some feature extraction layers on the top.

I've tried both Vulkan and OpenCL backend on MNN, but OpenCL is faster in my case, so the time in the initial post is OpenCL one.

gasgallo commented 4 years ago

@lydoc have you had the chance to investigate yet?

lydoc commented 4 years ago

Sorry for the late reply, is it convenient for you to share your model?

gasgallo commented 4 years ago

@lydoc you can grab model files and updated yaml configuration here

tassilo-posegga commented 4 years ago

We are having the same issue:

SM-G960U | 10,31FPS (Samsung Galaxy S9 GLOBAL with Adreno 630) SM-N960F | 5,68FPS (Samsung Galaxy S9 EU with Mali-G72)

We have about half the FPS on Mali GPUs of the same phone model

XiaoMi / mace