MobileNetV2-YOLOv3-Nano: Detection network designed by mobile terminal,0.5BFlops🔥🔥🔥HUAWEI P40 6ms& 3MB!!!

AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )

http://pjreddie.com/darknet/

Other

21.8k stars 7.97k forks source link

MobileNetV2-YOLOv3-Nano: Detection network designed by mobile terminal,0.5BFlops🔥🔥🔥HUAWEI P40 6ms& 3MB!!! #6091

Open dog-qiuqiu opened 4 years ago

dog-qiuqiu commented 4 years ago

Mobile inference frameworks benchmark (4*ARM_CPU)

Network	VOC mAP(0.5)	COCO mAP(0.5)	Resolution	Inference time (NCNN/Kirin 990)	Inference time (MNN arm82/Kirin 990)	FLOPS	Weight size
MobileNetV2-YOLOv3-Lite	72.61	36.57	320	33 ms	18 ms	1.8BFlops	8.0MB
MobileNetV2-YOLOv3-Nano	65.27	30.13	320	13 ms	5 ms	0.5BFlops	3.0MB
MobileNetV2-YOLOv3-Fastest	33.19	&	320	8.2 ms	3.67 ms	0.13BFlops	0.4MB

https://github.com/dog-qiuqiu/MobileNetv2-YOLOV3

AlexeyAB commented 4 years ago

@dog-qiuqiu Thanks! Can you test and compare MobileNetV2-YOLOv3-Lite vs yolov3-tiny.cfg vs yolov3-tiny-prn.cfg vs yolov4.cfg since they are already supported by NCNN? It seems that tiny-prn faster on GPU than tiny, while tiny faster on NPU than tiny-prn.

And yolov4-tiny.cfg when it will be implemented on NCNN: https://github.com/Tencent/ncnn/issues/1885

Also you can try to optimize yolov4-tiny.cfg for mobile CPU.

dog-qiuqiu commented 4 years ago

@AlexeyAB Hi This is the result of NCNN test, Huawei's Kirin 990, 4 core high performance： loop_count = 1 num_threads = 4 powersave = 0 gpu_device = -1 cooling_down = 0 MobileNetV2-YOLOv3-Lite-coco min = 31.58 max = 31.58 avg = 31.58 yolov3-tiny-prn min = 36.60 max = 36.60 avg = 36.60 yolov3-tiny min = 51.36 max = 51.36 avg = 51.36 yolov4 min = 733.67 max = 733.67 avg = 733.67

yolov4-tiny NCNN Does not seem to support

AlexeyAB commented 4 years ago

Thanks!

yolov4-tiny NCNN Does not seem to support

It was implemented 2 hours ago: https://github.com/Tencent/ncnn/commit/0bc45eed525531d8f0b6d1991bfd55f6e9428410

loop_count = 1 num_threads = 4 powersave = 0 gpu_device = -1 cooling_down = 0

Did you try gpu_device = 0 ?

dog-qiuqiu commented 4 years ago

OK! loop_count = 1 num_threads = 4 powersave = 0 gpu_device = 0 cooling_down = 0 MobileNetV2-YOLOv3-Lite-coco min = 33.14 max = 33.14 avg = 33.14 yolov3-tiny-prn min = 37.15 max = 37.15 avg = 37.15 yolov3-tiny min = 58.39 max = 58.39 avg = 58.39 yolov4 min = 781.29 max = 781.29 avg = 781.29

As far as I know, Mali-GPU has no efficiency advantage over ARM, at least on my Kirin 990, but Qualcomm GPUs may have efficiency improvements You try the arm82 of MNN, in theory, it will be twice as fast as NCNN without arm82

AlexeyAB commented 4 years ago

Yes, it seems this GPU doesn't improve speed. Try yolov4-tiny.

dog-qiuqiu commented 4 years ago

YOLOV4-TINY:

loop_count = 4 num_threads = 4 powersave = 0 gpu_device = -1 cooling_down = 0 MobileNetV2-YOLOv3-Lite-coco min = 35.15 max = 35.65 avg = 35.43 yolov3-tiny-prn min = 38.83 max = 39.16 avg = 38.96 yolov3-tiny min = 52.38 max = 53.01 avg = 52.74 yolov4-tiny min = 51.23 max = 51.64 avg = 51.42 yolov4 min = 779.41 max = 791.94 avg = 785.52

AlexeyAB commented 4 years ago

@dog-qiuqiu Thanks! So this is 20 FPS - 40.2% AP50 COCO for yolov4-tiny.cfg on CPU Kirin 990 (ARM) - Huawei P40

So you can try to improve yolov4-tiny in the same way as MobileNetV2-YOLOv3-Lite/Nano/Fastest. Or just add groups= to [conv] layers and may be SE-blocks.

AlexeyAB commented 4 years ago

https://github.com/dog-qiuqiu/MobileNetv2-YOLOV3

Darknet Group convolution is not well supported on some GPUs such as NVIDIA PASCAL!!! The MobileNetV2-YOLOv3-SPP inference time is 100ms at GTX1080ti, but RTX2080 inference time is 5ms!!!

I think there is so big difference 100ms / 5ms due to different cuDNN versions or something else (one compiled with CUDNN=1 and another with CUDNN=0).

Also about groups=. Tensor Cores on Volta/RTX will be used only if there is no groups (or groups=1) parameter in conv-layer, so for groups>1 will be used the same regular CUDA-cores (shaders) with about ~the same speed: https://github.com/AlexeyAB/darknet/blob/320e6fd8d29f6f7825ef668f15f955f90131f782/src/convolutional_kernels.cu#L423-L424

Darknet/TF/Pytorch/cuDNN/... use the same groups from cuDNN library.

dog-qiuqiu commented 4 years ago

I will try to improve yolov4-tiny with depthwise separable convolutions, Thank you for your work！！！

AlexeyAB commented 4 years ago

@dog-qiuqiu Hi, Did you try to test yolov4-tiny.cfg and MobileNetV2-YOLOv3-Lite-coco on Raspberry Pi3 / 4?

dog-qiuqiu commented 4 years ago

@AlexeyAB Okay, I have a Raspberry Pi 3b I will test the time-consuming benchmark

AlexeyAB commented 4 years ago

@dog-qiuqiu

I will try to improve yolov4-tiny with depthwise separable convolutions, Thank you for your work！！！

Okay, I have a Raspberry Pi 3b I will test the time-consuming benchmark

Hi, did you have any success with it?

dog-qiuqiu commented 4 years ago

@AlexeyAB Sorry, because my Raspberry Pi 3 is missing an SD card, I plan to buy an SD card on Saturday to test the Raspberry Pi 3 benchmark, but I can now run MobileNetV2-YOLOv3-Nano on Android in real time, and I plan to replace yolov4-tiny transplanted to Android to run in real time, this is the Android project: https://github.com/dog-qiuqiu/MobileNetv2-YOLOV3#ncnn-android-sample

dog-qiuqiu commented 4 years ago

@AlexeyAB Hi,This is a real-time detection Android project based on ncnn's yolov4-tiny:https://github.com/dog-qiuqiu/Android_NCNN_yolov4-tiny

AlexeyAB commented 4 years ago

@dog-qiuqiu Nice!

AlexeyAB commented 4 years ago

It seems RaspberryPi4 (4 Threads) can processes yolov4-tiny (int8, 416x416) with 4 FPS by using TFLite: https://github.com/PINTO0309/PINTO_model_zoo#3-tflite-model-benchmark

RaspberryPi4 + Ubuntu 19.10 aarch64 + 4 Threads + yolov4_tiny_voc_416x416_integer_quant.tflite Benchmark Timings (microseconds): count=50 first=233307 curr=233318 min=232446 max=364068 avg=243522 std=33354

TF models:

Just interesting to compare TFLite with NCNN.

Lowell-IC commented 4 years ago

1602672919(1) @AlexeyAB @dog-qiuqiu Hello! I am sorry to bother you. I want to ask that is it depthwise convolution in the picture if I change the left into the right? The answer is very important to me. Looking forward to your reply. Thanks a lot.

LYH-depth commented 3 years ago

@Lowell-IC brother do you get your anwser?