GPU Mesa 20.1.8 (with LLVM compiler):
$ ../build/benchmark/benchncnn 201 32 0 0
[0 AMD RADV FIJI (LLVM 10.0.1)] queueC=1[4] queueG=0[1] queueT=0[1]
[0 AMD RADV FIJI (LLVM 10.0.1)] bugsbn1=0 buglbia=0 bugcopc=0 bugihfa=0
[0 AMD RADV FIJI (LLVM 10.0.1)] fp16p=1 fp16s=1 fp16a=0 int8s=1 int8a=1
loop_count = 201
num_threads = 32
powersave = 0
gpu_device = 0
cooling_down = 1
squeezenet max = 297.89/s median = 262.67/s
squeezenet_int8 max = 50.97/s median = 49.12/s CPU
mobilenet max = 234.63/s median = 218.82/s
mobilenet_int8 max = 42.09/s median = 40.46/s CPU
mobilenet_v2 max = 151.54/s median = 144.13/s
mobilenet_v3 max = 110.36/s median = 104.60/s
shufflenet max = 334.34/s median = 296.55/s
shufflenet_v2 max = 213.27/s median = 199.68/s
mnasnet max = 143.14/s median = 134.66/s
proxylessnasnet max = 143.93/s median = 132.36/s
efficientnet_b0 max = 54.63/s median = 51.96/s
regnety_400m max = 97.26/s median = 86.96/s
blazeface max = 666.67/s median = 642.61/s
googlenet max = 61.35/s median = 58.17/s
googlenet_int8 max = 17.88/s median = 17.10/s CPU
resnet18 max = 130.17/s median = 125.06/s
resnet18_int8 max = 39.32/s median = 38.47/s CPU
alexnet max = 78.43/s median = 76.69/s
vgg16 max = 18.18/s median = 18.00/s
vgg16_int8 max = 8.95/s median = 8.71/s CPU
resnet50 max = 50.62/s median = 48.82/s
resnet50_int8 max = 11.71/s median = 11.49/s CPU
squeezenet_ssd max = 71.59/s median = 68.67/s
squeezenet_ssd_int8 max = 22.85/s median = 21.93/s CPU
mobilenet_ssd max = 106.66/s median = 103.20/s
mobilenet_ssd_int8 max = 21.76/s median = 21.14/s CPU
mobilenet_yolo max = 93.14/s median = 89.78/s
mobilenetv2_yolov3 max = 86.54/s median = 83.26/s
yolov4-tiny max = 59.21/s median = 56.48/s
$
Mesa 20.3+ (with ACO compiler)
$ ../build/benchmark/benchncnn 201 32 0 0
[0 AMD RADV FIJI (ACO)] queueC=1[4] queueG=0[1] queueT=0[1]
[0 AMD RADV FIJI (ACO)] bugsbn1=0 buglbia=0 bugcopc=0 bugihfa=0
[0 AMD RADV FIJI (ACO)] fp16p=1 fp16s=1 fp16a=0 int8s=1 int8a=1
loop_count = 201
num_threads = 32
powersave = 0
gpu_device = 0
cooling_down = 1
squeezenet max = 499.51/s median = 458.27/s
squeezenet_int8 max = 50.95/s median = 48.80/s CPU
mobilenet max = 416.51/s median = 369.68/s
mobilenet_int8 max = 43.15/s median = 41.29/s CPU
mobilenet_v2 max = 275.34/s median = 249.94/s
mobilenet_v3 max = 192.72/s median = 166.83/s
shufflenet max = 469.72/s median = 425.91/s
shufflenet_v2 max = 313.46/s median = 264.14/s
mnasnet max = 255.30/s median = 233.75/s
proxylessnasnet max = 259.54/s median = 238.49/s
efficientnet_b0 max = 74.33/s median = 69.06/s
regnety_400m max = 139.80/s median = 121.52/s
blazeface max = 832.69/s median = 797.35/s
googlenet max = 122.01/s median = 111.66/s
googlenet_int8 max = 17.70/s median = 16.98/s CPU
resnet18 max = 337.15/s median = 298.85/s
resnet18_int8 max = 39.29/s median = 38.19/s CPU
alexnet max = 159.01/s median = 150.24/s
vgg16 max = 68.10/s median = 66.24/s
vgg16_int8 max = 9.23/s median = 8.83/s CPU
resnet50 max = 113.25/s median = 106.73/s
resnet50_int8 max = 11.73/s median = 11.44/s CPU
squeezenet_ssd max = 148.79/s median = 135.11/s
squeezenet_ssd_int8 max = 23.30/s median = 22.48/s CPU
mobilenet_ssd max = 188.82/s median = 177.40/s
mobilenet_ssd_int8 max = 21.83/s median = 21.21/s CPU
mobilenet_yolo max = 216.93/s median = 200.84/s
mobilenetv2_yolov3 max = 133.31/s median = 125.02/s
yolov4-tiny max = 112.22/s median = 105.74/s
$
Device and driver supports int8 storage and arithmetic on Fury X. So I am not sure why the ncnn fallsback to CPU for int8 models.
The CPU usage is 100% when running int8 models, and the numbers are also consistent with numbers obtained from CPU only benchmark:
$ ../build/benchmark/benchncnn 201 32
loop_count = 201
num_threads = 32
powersave = 0
gpu_device = -1
cooling_down = 1
squeezenet max = 133.00/s median = 129.05/s CPU
squeezenet_int8 max = 51.18/s median = 49.35/s CPU
mobilenet max = 147.86/s median = 143.91/s CPU
mobilenet_int8 max = 42.67/s median = 41.62/s CPU
mobilenet_v2 max = 102.76/s median = 100.29/s CPU
mobilenet_v3 max = 108.28/s median = 107.31/s CPU
shufflenet max = 85.21/s median = 84.40/s CPU
shufflenet_v2 max = 107.72/s median = 106.89/s CPU
mnasnet max = 106.94/s median = 105.26/s CPU
proxylessnasnet max = 98.96/s median = 94.06/s CPU
efficientnet_b0 max = 78.64/s median = 76.64/s CPU
regnety_400m max = 19.10/s median = 18.96/s CPU
blazeface max = 287.02/s median = 280.49/s CPU
googlenet max = 41.67/s median = 40.75/s CPU
googlenet_int8 max = 17.83/s median = 17.06/s CPU
resnet18 max = 47.94/s median = 45.29/s CPU
resnet18_int8 max = 38.97/s median = 38.04/s CPU
alexnet max = 58.06/s median = 57.09/s CPU
vgg16 max = 15.70/s median = 15.25/s CPU
vgg16_int8 max = 8.40/s median = 8.24/s CPU
resnet50 max = 31.17/s median = 29.46/s CPU
resnet50_int8 max = 11.76/s median = 11.52/s CPU
squeezenet_ssd max = 41.65/s median = 40.48/s CPU
squeezenet_ssd_int8 max = 23.49/s median = 22.49/s CPU
mobilenet_ssd max = 68.86/s median = 66.57/s CPU
mobilenet_ssd_int8 max = 21.53/s median = 21.03/s CPU
mobilenet_yolo max = 38.25/s median = 37.10/s CPU
mobilenetv2_yolov3 max = 38.79/s median = 37.85/s CPU
yolov4-tiny max = 27.09/s median = 25.83/s CPU
$
Radeon R9 AMD Fury X (FIJI):
Mesa 20.3+ (with ACO compiler)
Device and driver supports int8 storage and arithmetic on Fury X. So I am not sure why the ncnn fallsback to CPU for int8 models.
The CPU usage is 100% when running int8 models, and the numbers are also consistent with numbers obtained from CPU only benchmark: