XiaoMi / mace

MACE is a deep learning inference framework optimized for mobile heterogeneous computing platforms.
Apache License 2.0
4.94k stars 819 forks source link

Inference is slow on MALI GPUs #545

Open gasgallo opened 5 years ago

gasgallo commented 5 years ago

Before you open an issue, please make sure you have tried the following steps:

  1. Make sure your environment is the same with (https://mace.readthedocs.io/en/latest/installation/env_requirement.html).
  2. Have you ever read the document for your usage?
  3. Check if your issue appears in HOW-TO-DEBUG or FAQ.
  4. The form below must be filled.

System information

Model deploy file (*.yml)

# The name of library
library_name: FD
target_abis: [arm64-v8a]
target_socs: [rk3399]
model_graph_format: file
model_data_format: file
models:
  RF: # model tag, which will be used in model loading and must be specific.
    platform: caffe
    # path to your tensorflow model's pb file. Support local path, http:// and https://
    model_file_path: /models/model.prototxt
    weight_file_path: /models/model.caffemodel
    # sha256_checksum of your model's pb file.
    # use this command to get the sha256_checksum --> sha256sum path/to/your/pb/file
    model_sha256_checksum: 81c388e812da37e499da8272eff0d7d140e8ae50dcb8d7e124dbd4e98462ad24
    weight_sha256_checksum: 2250beffe1bc13f96f60b95fa37f48848bb31f567ae9eb763c86496a4ae29c9b
    subgraphs:
      - input_tensors:
          - data
        input_shapes:
          - 1,3,640,480
        input_data_formats:
          - NCHW
        output_tensors:
          - face_rpn_cls_prob_stride128
          - face_rpn_bbox_pred_stride128
          - face_rpn_landmark_pred_stride128
          - face_rpn_cls_prob_stride64
          - face_rpn_bbox_pred_stride64
          - face_rpn_landmark_pred_stride64
          - face_rpn_cls_prob_stride32
          - face_rpn_bbox_pred_stride32
          - face_rpn_landmark_pred_stride32
          - face_rpn_cls_prob_stride16
          - face_rpn_bbox_pred_stride16
          - face_rpn_landmark_pred_stride16
          - face_rpn_cls_prob_stride8
          - face_rpn_bbox_pred_stride8
          - face_rpn_landmark_pred_stride8
          - face_rpn_cls_prob_stride4
          - face_rpn_bbox_pred_stride4
          - face_rpn_landmark_pred_stride4
        output_shapes:
          - 1,2,5,5
          - 1,4,5,5
          - 1,10,5,5
          - 1,2,10,10
          - 1,4,10,10
          - 1,10,10,10
          - 1,2,20,20
          - 1,4,20,20
          - 1,10,20,20
          - 1,2,40,40
          - 1,4,40,40
          - 1,10,40,40
          - 1,2,80,80
          - 1,4,80,80
          - 1,10,80,80
          - 1,2,160,160
          - 1,4,160,160
          - 1,10,160,160
        output_data_formats:
          - NCHW
          - NCHW
          - NCHW
          - NCHW
          - NCHW
          - NCHW
          - NCHW
          - NCHW
          - NCHW
          - NCHW
          - NCHW
          - NCHW
          - NCHW
          - NCHW
          - NCHW
          - NCHW
          - NCHW
          - NCHW
    obfuscate: 0
    runtime: cpu+gpu # cpu, gpu or cpu+gpu or dsp
    winograd: 4

Describe the problem

Inference time on MALI GPUs is very slow compared to other frameworks and a lot slower than the same model running on Adreno GPUs.

To Reproduce

Steps to reproduce the problem:

1. cd /path/to/mace
2. python tools/converter.py convert --config_file=/path/to/your/model_deployment_file
2. python tools/converter.py benchmark --config_file=/path/to/your/model_deployment_file

Error information / logs

Please include the full log and/or traceback here.

LOGs

Additional context

For example, the model running with the above yml file takes:

lydoc commented 5 years ago

I will check this later. Have you ever benchmarked another model like MobileNet? BTW, which backend of MNN did you use? OpenCL or Vulkan?

gasgallo commented 5 years ago

I will check this later. Have you ever benchmarked another model like MobileNet? BTW, which backend of MNN did you use? OpenCL or Vulkan?

Well, my model has a mobilenet backbone with just some feature extraction layers on the top.

I've tried both Vulkan and OpenCL backend on MNN, but OpenCL is faster in my case, so the time in the initial post is OpenCL one.

gasgallo commented 4 years ago

@lydoc have you had the chance to investigate yet?

lydoc commented 4 years ago

Sorry for the late reply, is it convenient for you to share your model?

gasgallo commented 4 years ago

@lydoc you can grab model files and updated yaml configuration here

tassilo-posegga commented 4 years ago

We are having the same issue:

SM-G960U | 10,31FPS (Samsung Galaxy S9 GLOBAL with Adreno 630) SM-N960F | 5,68FPS (Samsung Galaxy S9 EU with Mali-G72)

We have about half the FPS on Mali GPUs of the same phone model