Open dog-qiuqiu opened 4 years ago
@dog-qiuqiu Thanks!
Can you test and compare MobileNetV2-YOLOv3-Lite
vs yolov3-tiny.cfg
vs yolov3-tiny-prn.cfg
vs yolov4.cfg
since they are already supported by NCNN?
It seems that tiny-prn
faster on GPU than tiny
, while tiny
faster on NPU than tiny-prn
.
And yolov4-tiny.cfg
when it will be implemented on NCNN: https://github.com/Tencent/ncnn/issues/1885
Also you can try to optimize yolov4-tiny.cfg
for mobile CPU.
@AlexeyAB Hi This is the result of NCNN test, Huawei's Kirin 990, 4 core high performance: loop_count = 1 num_threads = 4 powersave = 0 gpu_device = -1 cooling_down = 0 MobileNetV2-YOLOv3-Lite-coco min = 31.58 max = 31.58 avg = 31.58 yolov3-tiny-prn min = 36.60 max = 36.60 avg = 36.60 yolov3-tiny min = 51.36 max = 51.36 avg = 51.36 yolov4 min = 733.67 max = 733.67 avg = 733.67
yolov4-tiny NCNN Does not seem to support
Thanks!
yolov4-tiny NCNN Does not seem to support
It was implemented 2 hours ago: https://github.com/Tencent/ncnn/commit/0bc45eed525531d8f0b6d1991bfd55f6e9428410
loop_count = 1 num_threads = 4 powersave = 0 gpu_device = -1 cooling_down = 0
Did you try gpu_device = 0
?
OK! loop_count = 1 num_threads = 4 powersave = 0 gpu_device = 0 cooling_down = 0 MobileNetV2-YOLOv3-Lite-coco min = 33.14 max = 33.14 avg = 33.14 yolov3-tiny-prn min = 37.15 max = 37.15 avg = 37.15 yolov3-tiny min = 58.39 max = 58.39 avg = 58.39 yolov4 min = 781.29 max = 781.29 avg = 781.29
As far as I know, Mali-GPU has no efficiency advantage over ARM, at least on my Kirin 990, but Qualcomm GPUs may have efficiency improvements You try the arm82 of MNN, in theory, it will be twice as fast as NCNN without arm82
Yes, it seems this GPU doesn't improve speed. Try yolov4-tiny.
YOLOV4-TINY:
loop_count = 4 num_threads = 4 powersave = 0 gpu_device = -1 cooling_down = 0 MobileNetV2-YOLOv3-Lite-coco min = 35.15 max = 35.65 avg = 35.43 yolov3-tiny-prn min = 38.83 max = 39.16 avg = 38.96 yolov3-tiny min = 52.38 max = 53.01 avg = 52.74 yolov4-tiny min = 51.23 max = 51.64 avg = 51.42 yolov4 min = 779.41 max = 791.94 avg = 785.52
@dog-qiuqiu Thanks! So this is 20 FPS - 40.2% AP50 COCO for yolov4-tiny.cfg on CPU Kirin 990 (ARM) - Huawei P40
So you can try to improve yolov4-tiny in the same way as MobileNetV2-YOLOv3-Lite/Nano/Fastest. Or just add groups=
to [conv] layers and may be SE-blocks.
https://github.com/dog-qiuqiu/MobileNetv2-YOLOV3
Darknet Group convolution is not well supported on some GPUs such as NVIDIA PASCAL!!! The MobileNetV2-YOLOv3-SPP inference time is 100ms at GTX1080ti, but RTX2080 inference time is 5ms!!!
I think there is so big difference 100ms / 5ms due to different cuDNN versions or something else (one compiled with CUDNN=1 and another with CUDNN=0).
Also about groups=. Tensor Cores on Volta/RTX will be used only if there is no groups (or groups=1) parameter in conv-layer, so for groups>1 will be used the same regular CUDA-cores (shaders) with about ~the same speed: https://github.com/AlexeyAB/darknet/blob/320e6fd8d29f6f7825ef668f15f955f90131f782/src/convolutional_kernels.cu#L423-L424
Darknet/TF/Pytorch/cuDNN/... use the same groups from cuDNN library.
I will try to improve yolov4-tiny with depthwise separable convolutions, Thank you for your work!!!
@dog-qiuqiu Hi, Did you try to test yolov4-tiny.cfg
and MobileNetV2-YOLOv3-Lite-coco
on Raspberry Pi3 / 4?
@AlexeyAB Okay, I have a Raspberry Pi 3b I will test the time-consuming benchmark
@dog-qiuqiu
I will try to improve yolov4-tiny with depthwise separable convolutions, Thank you for your work!!!
Okay, I have a Raspberry Pi 3b I will test the time-consuming benchmark
Hi, did you have any success with it?
@AlexeyAB Sorry, because my Raspberry Pi 3 is missing an SD card, I plan to buy an SD card on Saturday to test the Raspberry Pi 3 benchmark, but I can now run MobileNetV2-YOLOv3-Nano on Android in real time, and I plan to replace yolov4-tiny transplanted to Android to run in real time, this is the Android project: https://github.com/dog-qiuqiu/MobileNetv2-YOLOV3#ncnn-android-sample
@AlexeyAB Hi,This is a real-time detection Android project based on ncnn's yolov4-tiny:https://github.com/dog-qiuqiu/Android_NCNN_yolov4-tiny
@dog-qiuqiu Nice!
It seems RaspberryPi4 (4 Threads) can processes yolov4-tiny (int8, 416x416) with 4 FPS by using TFLite: https://github.com/PINTO0309/PINTO_model_zoo#3-tflite-model-benchmark
RaspberryPi4 + Ubuntu 19.10 aarch64 + 4 Threads + yolov4_tiny_voc_416x416_integer_quant.tflite Benchmark Timings (microseconds): count=50 first=233307 curr=233318 min=232446 max=364068 avg=243522 std=33354
TF models:
Just interesting to compare TFLite with NCNN.
@AlexeyAB @dog-qiuqiu Hello! I am sorry to bother you. I want to ask that is it depthwise convolution in the picture if I change the left into the right? The answer is very important to me. Looking forward to your reply. Thanks a lot.
@Lowell-IC brother do you get your anwser?
Mobile inference frameworks benchmark (4*ARM_CPU)
https://github.com/dog-qiuqiu/MobileNetv2-YOLOV3