Tencent / ncnn

ncnn is a high-performance neural network inference framework optimized for the mobile platform
Other
20.43k stars 4.16k forks source link

all int8 models fallbacks to CPU when running benchncnn #2170

Closed baryluk closed 3 years ago

baryluk commented 4 years ago

Radeon R9 AMD Fury X (FIJI):

GPU Mesa 20.1.8 (with LLVM compiler):

$ ../build/benchmark/benchncnn 201 32 0 0 
[0 AMD RADV FIJI (LLVM 10.0.1)]  queueC=1[4]  queueG=0[1]  queueT=0[1]
[0 AMD RADV FIJI (LLVM 10.0.1)]  bugsbn1=0  buglbia=0  bugcopc=0  bugihfa=0
[0 AMD RADV FIJI (LLVM 10.0.1)]  fp16p=1  fp16s=1  fp16a=0  int8s=1  int8a=1
loop_count = 201
num_threads = 32
powersave = 0
gpu_device = 0
cooling_down = 1
          squeezenet        max =  297.89/s  median =  262.67/s
     squeezenet_int8        max =   50.97/s  median =   49.12/s  CPU
           mobilenet        max =  234.63/s  median =  218.82/s
      mobilenet_int8        max =   42.09/s  median =   40.46/s  CPU
        mobilenet_v2        max =  151.54/s  median =  144.13/s
        mobilenet_v3        max =  110.36/s  median =  104.60/s
          shufflenet        max =  334.34/s  median =  296.55/s
       shufflenet_v2        max =  213.27/s  median =  199.68/s
             mnasnet        max =  143.14/s  median =  134.66/s
     proxylessnasnet        max =  143.93/s  median =  132.36/s
     efficientnet_b0        max =   54.63/s  median =   51.96/s
        regnety_400m        max =   97.26/s  median =   86.96/s
           blazeface        max =  666.67/s  median =  642.61/s
           googlenet        max =   61.35/s  median =   58.17/s
      googlenet_int8        max =   17.88/s  median =   17.10/s  CPU
            resnet18        max =  130.17/s  median =  125.06/s
       resnet18_int8        max =   39.32/s  median =   38.47/s  CPU
             alexnet        max =   78.43/s  median =   76.69/s
               vgg16        max =   18.18/s  median =   18.00/s
          vgg16_int8        max =    8.95/s  median =    8.71/s  CPU
            resnet50        max =   50.62/s  median =   48.82/s
       resnet50_int8        max =   11.71/s  median =   11.49/s  CPU
      squeezenet_ssd        max =   71.59/s  median =   68.67/s
 squeezenet_ssd_int8        max =   22.85/s  median =   21.93/s  CPU
       mobilenet_ssd        max =  106.66/s  median =  103.20/s
  mobilenet_ssd_int8        max =   21.76/s  median =   21.14/s  CPU
      mobilenet_yolo        max =   93.14/s  median =   89.78/s
  mobilenetv2_yolov3        max =   86.54/s  median =   83.26/s
         yolov4-tiny        max =   59.21/s  median =   56.48/s
$

Mesa 20.3+ (with ACO compiler)

$ ../build/benchmark/benchncnn 201 32 0 0 
[0 AMD RADV FIJI (ACO)]  queueC=1[4]  queueG=0[1]  queueT=0[1]
[0 AMD RADV FIJI (ACO)]  bugsbn1=0  buglbia=0  bugcopc=0  bugihfa=0
[0 AMD RADV FIJI (ACO)]  fp16p=1  fp16s=1  fp16a=0  int8s=1  int8a=1
loop_count = 201
num_threads = 32
powersave = 0
gpu_device = 0
cooling_down = 1
          squeezenet        max =  499.51/s  median =  458.27/s
     squeezenet_int8        max =   50.95/s  median =   48.80/s  CPU
           mobilenet        max =  416.51/s  median =  369.68/s
      mobilenet_int8        max =   43.15/s  median =   41.29/s  CPU
        mobilenet_v2        max =  275.34/s  median =  249.94/s
        mobilenet_v3        max =  192.72/s  median =  166.83/s
          shufflenet        max =  469.72/s  median =  425.91/s
       shufflenet_v2        max =  313.46/s  median =  264.14/s
             mnasnet        max =  255.30/s  median =  233.75/s
     proxylessnasnet        max =  259.54/s  median =  238.49/s
     efficientnet_b0        max =   74.33/s  median =   69.06/s
        regnety_400m        max =  139.80/s  median =  121.52/s
           blazeface        max =  832.69/s  median =  797.35/s
           googlenet        max =  122.01/s  median =  111.66/s
      googlenet_int8        max =   17.70/s  median =   16.98/s  CPU
            resnet18        max =  337.15/s  median =  298.85/s
       resnet18_int8        max =   39.29/s  median =   38.19/s  CPU
             alexnet        max =  159.01/s  median =  150.24/s
               vgg16        max =   68.10/s  median =   66.24/s
          vgg16_int8        max =    9.23/s  median =    8.83/s  CPU
            resnet50        max =  113.25/s  median =  106.73/s
       resnet50_int8        max =   11.73/s  median =   11.44/s  CPU
      squeezenet_ssd        max =  148.79/s  median =  135.11/s
 squeezenet_ssd_int8        max =   23.30/s  median =   22.48/s  CPU
       mobilenet_ssd        max =  188.82/s  median =  177.40/s
  mobilenet_ssd_int8        max =   21.83/s  median =   21.21/s  CPU
      mobilenet_yolo        max =  216.93/s  median =  200.84/s
  mobilenetv2_yolov3        max =  133.31/s  median =  125.02/s
         yolov4-tiny        max =  112.22/s  median =  105.74/s
$

Device and driver supports int8 storage and arithmetic on Fury X. So I am not sure why the ncnn fallsback to CPU for int8 models.

The CPU usage is 100% when running int8 models, and the numbers are also consistent with numbers obtained from CPU only benchmark:

$ ../build/benchmark/benchncnn 201 32
loop_count = 201
num_threads = 32
powersave = 0
gpu_device = -1
cooling_down = 1
          squeezenet        max =  133.00/s  median =  129.05/s  CPU
     squeezenet_int8        max =   51.18/s  median =   49.35/s  CPU
           mobilenet        max =  147.86/s  median =  143.91/s  CPU
      mobilenet_int8        max =   42.67/s  median =   41.62/s  CPU
        mobilenet_v2        max =  102.76/s  median =  100.29/s  CPU
        mobilenet_v3        max =  108.28/s  median =  107.31/s  CPU
          shufflenet        max =   85.21/s  median =   84.40/s  CPU
       shufflenet_v2        max =  107.72/s  median =  106.89/s  CPU
             mnasnet        max =  106.94/s  median =  105.26/s  CPU
     proxylessnasnet        max =   98.96/s  median =   94.06/s  CPU
     efficientnet_b0        max =   78.64/s  median =   76.64/s  CPU
        regnety_400m        max =   19.10/s  median =   18.96/s  CPU
           blazeface        max =  287.02/s  median =  280.49/s  CPU
           googlenet        max =   41.67/s  median =   40.75/s  CPU
      googlenet_int8        max =   17.83/s  median =   17.06/s  CPU
            resnet18        max =   47.94/s  median =   45.29/s  CPU
       resnet18_int8        max =   38.97/s  median =   38.04/s  CPU
             alexnet        max =   58.06/s  median =   57.09/s  CPU
               vgg16        max =   15.70/s  median =   15.25/s  CPU
          vgg16_int8        max =    8.40/s  median =    8.24/s  CPU
            resnet50        max =   31.17/s  median =   29.46/s  CPU
       resnet50_int8        max =   11.76/s  median =   11.52/s  CPU
      squeezenet_ssd        max =   41.65/s  median =   40.48/s  CPU
 squeezenet_ssd_int8        max =   23.49/s  median =   22.49/s  CPU
       mobilenet_ssd        max =   68.86/s  median =   66.57/s  CPU
  mobilenet_ssd_int8        max =   21.53/s  median =   21.03/s  CPU
      mobilenet_yolo        max =   38.25/s  median =   37.10/s  CPU
  mobilenetv2_yolov3        max =   38.79/s  median =   37.85/s  CPU
         yolov4-tiny        max =   27.09/s  median =   25.83/s  CPU
$
nihui commented 4 years ago

yes, this is intentional The int8 quantized inference is currently not implemented on vulkan, so it will fallback to cpu.