Open AlexeyAB opened 4 years ago
I know that Alex is telling the truth. Can you test on my system? i want to see this
you can install it from here; https://github.com/ceccocats/tkDNN and use models described above with COCO images from COCO dataset website..
To detect small objects you must also use 3-yolo-layers in yolov4-tiny.
Hello, can you please share the cfg file for this?
Hi guys!
How this compares to yolo_v3_tiny_pan3_aa_ae_mixup_scale_giou.cfg
model in speed/accuracy?
To detect small objects you must also use 3-yolo-layers in yolov4-tiny.
I think a 3-yolo-layers yolov4-tiny would be a game changer for my application. If someone can make the change and retrain in the COCO dataset, I would appreciate it very much.
Comparison of accuracy of Yolov3-tiny (width=800 height=800) vs Yolov4-tiny (width=800 height=800)
source | yolov3-tiny (800x800) | yolov4-tiny (800x800) |
---|---|---|
Great work @AlexeyAB !
Comparison of accuracy of Yolov3-tiny (width=800 height=800) vs Yolov4-tiny (width=800 height=800)
source yolov3-tiny (800x800) yolov4-tiny (800x800)
Hi, Can you share the class Dataset?
@wwzh2015 This is the default yolov4-tiny model that is trained on http://mscoco.org/ dataset (just set width=800 height=800
in cfg-files): https://github.com/AlexeyAB/darknet#pre-trained-models
Opencv 4.4.0 can load this model in your thought? Now it failed to build. Thank you.
This is amazing, but it fails to detect small objects. My use case only has small objects (about 10 - 30 pixels). Should I regenerate anchors or use a third yolo layer? If the latter, how can I change the cfg file to add a third layer? Thanks!
This is amazing, but it fails to detect small objects. My use case only has small objects (about 10 - 30 pixels). Should I regenerate anchors or use a third yolo layer? If the latter, how can I change the cfg file to add a third layer? Thanks!
I think use a third yolo layer that is exactly.
Comparison of accuracy of Yolov3-tiny (width=800 height=800) vs Yolov4-tiny (width=800 height=800)
source yolov3-tiny (800x800) yolov4-tiny (800x800)
@AlexeyAB world-class! in coming years it will only improve.
371 FPS on GPU GTX 1080 Ti - Darknet framework
Is the input resulation 419*416 can perform 371 FPS on GPU GTX 1080 Ti?
OpenCV uses cuDNN and does not support TensorRT. OpenCV will be slower on low-end devices as convolutions become a bigger bottleneck where TensorRT based solutions will outperform cuDNN based solutions. The CUDA backend in OpenCV is just six months old. So be sure to check again in the future.
Prefer tkDNN if your device is not a high-end GPU. Use tkDNN if you need INT8 precision. On high-end device, if tiny performance gains matter a lot, try both OpenCV and tkDNN.
You can extract much higher FPS on low-end devices using a pipeline with detection, tracking, etc. You can find an outdated example here.
CUDA version: 10.2 cuDNN version: 7.6.5
Benchmark Code: https://gist.github.com/YashasSamaga/48bdb167303e10f4d07b754888ddbdcf
Numbers in the tables indicate FPS.
Device: RTX 2080 Ti
Input Size | OCV CUDA FP32 (batch = 1) | OCV CUDA FP32 (batch = 4) | OCV CUDA FP16 (batch = 1) | OCV CUDA FP16 (batch = 4) |
---|---|---|---|---|
320 x 320 | 137 | 208 | 183 | 430 |
416 x 416 | 106 | 148 | 159 | 294 |
512 x 512 | 95 | 121 | 138 | 216 |
608 x 608 | 60 | 72 | 115 | 149 |
Device: GTX 1080 Ti
Input Size | OCV CUDA FP32 (batch = 1) | OCV CUDA FP32 (batch = 4) |
---|---|---|
320 x 320 | 87 | 177 |
416 x 416 | 75 | 116 |
512 x 512 | 64 | 87 |
608 x 608 | 48 | 58 |
Calculated using the dataset and list from How to evaluate accuracy and speed of YOLOv4
Code: https://gist.github.com/YashasSamaga/077a1d69c48e4cdb9957d167b7000b98
Note: thresh=0.001
was added to all [yolo]
blocks in yolov4.cfg
Device: RTX 2080 Ti
Numbers in the tables indicate FPS.
Device: RTX 2080 Ti
Input Size | OCV CUDA FP32 (batch = 1) | OCV CUDA FP32 (batch = 4) | OCV CUDA FP16 (batch = 1) | OCV CUDA FP16 (batch = 4) |
---|---|---|---|---|
416 x 416 | 754 | 973 | 773 | 1353 |
Device: GTX 1080 Ti
Input Size | OCV CUDA FP32 (batch = 1) | OCV CUDA FP32 (batch = 4) |
---|---|---|
416 x 416 | 557 | 792 |
Timings reported by tkDNN do not include the time spent in transferring data between CPU and GPU. If I have understood correctly, tkDNN allows you to manually manage the transfer process. You can overlap the data transfer process with inference and hide the data transfer costs completely. Therefore, you can achieve the performance that tkDNN reports even if you have/need data on CPU.
Timings reported in this post include the transfer time which makes a significant chunk of the inference time. OpenCV doesn't allow you to control the data transfer process nor does it allow you to provide cv::cuda::GpuMat
as input. Hence, you won't be able to easily hide the data transfer time. You can mitigate it partially or fully by using multiple cv::dnn::Net
objects. Using multiple cv::dnn::Net
objects also has the additional benefit of keeping the GPU busy always (i.e. you will reduce GPU idle time
between two inference workloads).
OpenCV partially mitigates the data transfer cost if your network has multiple outputs. The GPU to CPU transfer of output begins immediately when the output becomes available. YOLOv4 608x608 has three output blobs and they take 0.5ms, 0.3ms and 0.04ms for data transfer on RTX 2080 Ti (input transfer takes 0.36ms more). OpenCV begins transferring the 0.5ms output while the GPU is busy computing the 0.4ms and 0.04ms outputs. This way OpenCV completely hides the transfer cost of 0.5ms output and 0.36ms output. Overall, only the transfer of input from CPU to GPU and the transfer of the last output from GPU to CPU is visible in the benchmarks (which together are still significant). This mitigation strategy doesn't work well with YOLOv4 Tiny. OpenCV computes the smaller output blobs first and then the largest output blob. Hence, the gains from this mitigation strategy aren't as high as YOLOv4. It's possible to hack around with the order of the layers in yolov4-tiny.cfg
and somehow trick the importer to schedule the layers such that the largest output blob is computed first.
If you're interested, here are the extra data transfer costs that were incurred. To calculate the inference only FPS, you have to deduct this from the mean inference time reported and then calculate the FPS.
GTX 1080 Ti excess for YOLOv4 608x608 (batch = 1): 0.4ms (0.36ms input + 0.04ms last output) GTX 1080 Ti excess for YOLOv4Tiny 416x416 (batch = 1): 0.22ms (0.17ms input + 0.055ms last output)
RTX 2080 Ti excess for YOLOv4 608x608 (batch = 1): 0.4ms (0.36ms input + 0.04ms last output) RTX 2080 Ti excess for YOLOv4 608x608 (batch = 4): 1.65ms (1.52ms input + 0.12ms last output) RTX 2080 Ti excess for YOLOv4Tiny 416x416 (batch = 1): 0.23ms (0.17ms input + 0.05ms last output) RTX 2080 Ti excess for YOLOv4Tiny 416x416 (batch = 4): 0.93ms (0.72ms input + 0.2ms last output)
For example, for YOLOv4 Tiny on RTX 2080 Ti: 2.957ms - 0.93ms = 2.027ms
which is 0.507ms
per batch item (1972 FPS).
It's not very meaningful to do this procedure. The data transfer cost becomes negligible as you use low-end devices (the computation takes most of the time).
This is what you can expect in the upcoming OpenCV 4.4 release.
CUDA version: 10.2 cuDNN version: 7.6.5
Benchmark Code: https://gist.github.com/YashasSamaga/48bdb167303e10f4d07b754888ddbdcf
YOLOv4
Performance
Device: RTX 2080 Ti
Input Size OCV CUDA FP32 (batch = 1) OCV CUDA FP32 (batch = 4) OCV CUDA FP16 (batch = 1) OCV CUDA FP16 (batch = 4) 320 x 320 137 208 183 430 416 x 416 106 148 159 294 512 x 512 95 121 138 216 608 x 608 60 72 115 149 Device: GTX 1080 Ti
Input Size OCV CUDA FP32 (batch = 1) OCV CUDA FP32 (batch = 4) 320 x 320 87 177 416 x 416 75 116 512 x 512 64 87 608 x 608 48 58 stats for batch = 1 on RTX 2080 Ti stats for batch = 4 on RTX 2080 Ti stats for batch = 1 on GTX 1080 Ti stats for batch = 4 on GTX 1080 Ti
Accuracy
Calculated using the dataset and list from How to evaluate accuracy and speed of YOLOv4
Code: https://gist.github.com/YashasSamaga/077a1d69c48e4cdb9957d167b7000b98
Note:
thresh=0.001
was added to all[yolo]
blocks inyolov4.cfg
Device: RTX 2080 Ti
Darknet FP32 OCV CUDA FP32 OCV CUDA FP16
YOLOv4 Tiny
Device: RTX 2080 Ti
Input Size OCV CUDA FP32 (batch = 1) OCV CUDA FP32 (batch = 4) OCV CUDA FP16 (batch = 1) OCV CUDA FP16 (batch = 4) 416 x 416 754 973 773 1353 Device: GTX 1080 Ti
Input Size OCV CUDA FP32 (batch = 1) OCV CUDA FP32 (batch = 4) 416 x 416 557 792 stats for batch = 1 on RTX 2080 Ti stats for batch = 4 on RTX 2080 Ti stats for batch = 1 on GTX 1080 Ti stats for batch = 4 on GTX 1080 Ti
GPU-CPU data transfer and comparing with tkDNN
Timings reported by tkDNN do not include the time spent in transferring data between CPU and GPU. If I have understood correctly, tkDNN allows you to manually manage the transfer process. You can overlap the data transfer process with inference and hide the data transfer costs completely. Therefore, you can achieve the performance that tkDNN reports even if you have/need data on CPU.
Timings reported in this post include the transfer time which makes a significant chunk of the inference time. OpenCV doesn't allow you to control the data transfer process nor does it allow you to provide
cv::cuda::GpuMat
as input. Hence, you won't be able to easily hide the data transfer time. You can mitigate it partially or fully by using multiplecv::dnn::Net
objects. Using multiplecv::dnn::Net
objects also has the additional benefit of keeping the GPU busy always (i.e. you will reduce GPU idle time between two inference workloads).OpenCV partially mitigates the data transfer cost if your network has multiple outputs. The GPU to CPU transfer of output begins immediately when the output becomes available. YOLOv4 608x608 has three output blobs and they take 0.5ms, 0.3ms and 0.04ms for data transfer on RTX 2080 Ti (input transfer takes 0.36ms more). OpenCV begins transferring the 0.5ms output while the GPU is busy computing the 0.4ms and 0.04ms outputs. This way OpenCV completely hides the transfer cost of 0.5ms output and 0.36ms output. Overall, only the transfer of input from CPU to GPU and the transfer of the last output from GPU to CPU is visible in the benchmarks (which together are still significant). This mitigation strategy doesn't work well with YOLOv4 Tiny. OpenCV computes the smaller output blobs first and then the largest output blob. Hence, the gains from this mitigation strategy aren't as high as YOLOv4. It's possible to hack around with the order of the layers in
yolov4-tiny.cfg
and somehow trick the importer to schedule the layers such that the largest output blob is computed first.If you're interested, here are the extra data transfer costs that were incurred. To calculate the inference only FPS, you have to deduct this from the mean inference time reported and then calculate the FPS.
GTX 1080 Ti excess for YOLOv4 608x608 (batch = 1): 0.4ms (0.36ms input + 0.04ms last output) GTX 1080 Ti excess for YOLOv4Tiny 416x416 (batch = 1): 0.22ms (0.17ms input + 0.055ms last output)
RTX 1080 Ti excess for YOLOv4 608x608 (batch = 1): 0.4ms (0.36ms input + 0.04ms last output) RTX 1080 Ti excess for YOLOv4 608x608 (batch = 4): 1.65ms (1.52ms input + 0.12ms last output) RTX 1080 Ti excess for YOLOv4Tiny 416x416 (batch = 1): 0.23ms (0.17ms input + 0.05ms last output) RTX 1080 Ti excess for YOLOv4Tiny 416x416 (batch = 4): 0.93ms (0.72ms input + 0.2ms last output)
For example, for YOLOv4 Tiny on RTX 2080 Ti:
2.957ms - 0.93ms = 2.027ms
which is0.507ms
per batch item (1972 FPS).It's not very meaningful to do this procedure. The data transfer cost becomes negligible as you use low-end devices (the computation takes most of the time).
Thank you for the great post. Any comments or benchmarks while running inference on CPU with IE Backend? Thanks
@YashasSamaga Geat!
It seems that OpenCV is now faster than tkDNN-TensorRT for yolov4.cfg
in the most cases even with the cost of data transmission!
Can you test, what AVG_FPS do you get for YOLOv4-tiny 416x416 on RTX 2080Ti with flag -benchmark
and wait 10 seconds?
@mmaaz60
7700HQ, GTX 1050 Mobile CUDA 10.2, cuDNN 7.6.5 (CUDA timings in detailed stats collapsible) MKL 2020.1.217 OpenVINO 2020.3.194
OCV CPU (batch = 1) | OCV CPU (batch = 4) | OCV IE CPU (batch = 1) | OCV IE CPU (batch = 4) |
---|---|---|---|
28 | 26 | 42 | 39 |
@AlexeyAB
./darknet detector demo cfg/coco.data cfg/yolov4-tiny.cfg yolov4-tiny.weights test.mp4 -benchmark
GPU=1
CUDNN=1
CUDNN_HALF=1
OPENCV=1
gives 443 AVG FPS on RTX 2080 Ti after ~20s
(including: nms, pre/post-processing, transfering CPU->GPU and GPU->CPU)
NMS/pre/post-processing is not included in the timings I reported. CPU->GPU, inference and GPU->CPU is included.
@AlexeyAB I think the tkDNN timings in darknet readme is outdated. The new tkDNN timings beats OpenCV in most cases but OpenCV seems to outperform tkDNN with the data transfer correction.
If someone has their input already on GPU and require outputs on GPU, tkDNN will beat OpenCV. If the input is on CPU and outputs are required on CPU, OpenCV will likely beat tkDNN.
@YashasSamaga
Thanks, I fixed it.
Did somebody try to implement NMS and/or pre-processing (resizing/converting RGB->float) on GPU for OpenCV / tkDNN-TRT?
What will be faster?
Why do you get lower FPS for batch=4 (39 FPS) than for batch=1 (42 FPS) ? Or is it actually 39 x 4 = 156 FPS?
OCV CPU (batch = 1) OCV CPU (batch = 4) OCV IE CPU (batch = 1) OCV IE CPU (batch = 4) 28 26 42 39
./darknet detector demo cfg/coco.data cfg/yolov4-tiny.cfg yolov4-tiny.weights test.mp4 -benchmark
GPU=0
CUDNN=0
CUDNN_HALF=0
OPENCV=1
OPENMP=1
AVX=1
7700HQ
608 x 608 input for YOLOv4 416 x 416 input for YOLOv4 Tiny
Numbers in the table indicate FPS.
Model | Darknet | OCV CPU (batch = 1) | OCV IE (batch = 1) | OCV CPU (batch = 4) | OCV IE (batch = 4) |
---|---|---|---|---|---|
YOLOv4 | 0.2 | 1.4 | 1.0 | 1.36 | 1.16 |
YOLOv4Tiny | 3.4 | 28 | 43 | 25.6 | 40.4 |
So batch = 4 is again slower. Or batch = 1 is wrongly fast. Or the way I benchmark is not correct for CPU workloads.
@YashasSamaga Thanks!
YOLOv4Tiny | 3.4 FPS (Darknet)
So you get 2x slower FPS
on Core i7 7700HQ (2.8/3.8 GHz turbo boost, 8HT cores) Mobile
- 3.4 FPS https://ark.intel.com/content/www/ru/ru/ark/products/97185/intel-core-i7-7700hq-processor-6m-cache-up-to-3-80-ghz.html
than I on Core i7 6700K (4.0/4.2 GHz turbo boost, 8HT cores) Desktop
- 6.1 FPS https://ark.intel.com/content/www/ru/ru/ark/products/88195/intel-core-i7-6700k-processor-8m-cache-up-to-4-20-ghz.html
Did you close all other applications (web-browser, ...) when test it with -benchmark
flag?
Did you close all other applications (web-browser, ...) when test it with -benchmark flag?
Yes. Darknet is using ~780% of the CPU (read it off the top command).
Benchmarking on CPUs is always confusing. I have seen FPS double by using a different BLAS library in my projects. I have also seen FPS triple moving from 7700HQ to 7700K!
Comparison of Yolo4-tiny, Yolo4-tiny3l, Yolo4-tiny3l-spp. Custom dataset https://www.youtube.com/watch?v=RtEogGr3aW8&feature https://www.youtube.com/watch?v=hpjh_SEXtm0
@Anafeyka Could you share your
Yolo4-tiny3l-spp
cfg? Your results are very good.
@Anafeyka Could you share your
Yolo4-tiny3l-spp
cfg? Your results are very good.
Thanks @Anafeyka,
What is the speed comparison of Yolov4-tiny vs Yolov4-tiny3l & Yolov4-tiny3l-spp?
@Anafeyka Could you share your
Yolo4-tiny3l-spp
cfg? Your results are very good.Thanks @Anafeyka,
What is the speed comparison of Yolov4-tiny vs Yolov4-tiny3l & Yolov4-tiny3l-spp?
I can't check the speed on a lot of equipment right now But I can give you all the weights. https://drive.google.com/file/d/1aSFz5X9OkK8ZeDJoeTc-NrXw4ojf53t6/view?usp=sharing
@Anafeyka Could you share your
Yolo4-tiny3l-spp
cfg? Your results are very good.
Could you please also share the one of yolo4 tiny 3l please?
@Anafeyka Не могли бы вы поделиться своим
Yolo4-tiny3l-spp
CFG? Ваши результаты очень хорошие.Не могли бы вы также поделиться с Yolo4 Tiny 3l, пожалуйста?
1 https://github.com/AlexeyAB/darknet/tree/master/cfg 2 All cfg + weights https://drive.google.com/file/d/1aSFz5X9OkK8ZeDJoeTc-NrXw4ojf53t6/view?usp=sharing
OpenCV 4.4.0 is realeasd, it supports YOLOv4 and YOLOv4-tiny: https://github.com/opencv/opencv/wiki/ChangeLog#version440
All OpenCV releases: https://opencv.org/releases/
Discussion: https://github.com/AlexeyAB/darknet/issues/6284
was the conclusion that opencv is faster than tkdnn for yolov4 / yolov4-tiny?
was the conclusion that opencv is faster than tkdnn for yolov4 / yolov4-tiny?
There is no boolean answer to this. It depends on the device. tkDNN Is likely faster on all low-end devices. You might have to test both frameworks on high-end devices. The location (CPU or GPU) of your input and output might also make a difference.
@Anafeyka did you use the https://github.com/AlexeyAB/darknet/releases/download/darknet_yolo_v4_pre/yolov4-tiny.conv.29 weights when you trained Yolo4-tiny3l and Yolo4-tiny3l-spp for faces ? great work !
@Anafeyka did you use the https://github.com/AlexeyAB/darknet/releases/download/darknet_yolo_v4_pre/yolov4-tiny.conv.29 weights when you trained Yolo4-tiny3l and Yolo4-tiny3l-spp for faces ? great work !
No
darknet.exe detector train custom/hf_obj.data custom/yolov4-tiny_3l.cfg -map
was the conclusion that opencv is faster than tkdnn for yolov4 / yolov4-tiny?
There is no boolean answer to this. It depends on the device. tkDNN Is likely faster on all low-end devices. You might have to test both frameworks on high-end devices. The location (CPU or GPU) of your input and output might also make a difference.
@LukeAI @YashasSamaga , I have been testing OpenCV vs. TKDNN on Jetson Xavier.
It seems like TKDNN is a bit faster than OpenCV...But it's licensing causes minor problems if you want to use it for commercial purposes.
As Yashas mentioned, I do not know if TKDNN includes the preprocessing/NMS/postprocessing in their overall timing.
@Anafeyka did you use the https://github.com/AlexeyAB/darknet/releases/download/darknet_yolo_v4_pre/yolov4-tiny.conv.29 weights when you trained Yolo4-tiny3l and Yolo4-tiny3l-spp for faces ? great work !
No
darknet.exe detector train custom/hf_obj.data custom/yolov4-tiny_3l.cfg -map
Thanks @Anafeyka
was the conclusion that opencv is faster than tkdnn for yolov4 / yolov4-tiny?
There is no boolean answer to this. It depends on the device. tkDNN Is likely faster on all low-end devices. You might have to test both frameworks on high-end devices. The location (CPU or GPU) of your input and output might also make a difference.
@LukeAI @YashasSamaga , I have been testing OpenCV vs. TKDNN on Jetson Xavier.
It seems like TKDNN is a bit faster than OpenCV...But it's licensing causes minor problems if you want to use it for commercial purposes.
you mean the GPL?
was the conclusion that opencv is faster than tkdnn for yolov4 / yolov4-tiny?
There is no boolean answer to this. It depends on the device. tkDNN Is likely faster on all low-end devices. You might have to test both frameworks on high-end devices. The location (CPU or GPU) of your input and output might also make a difference.
@LukeAI @YashasSamaga , I have been testing OpenCV vs. TKDNN on Jetson Xavier. It seems like TKDNN is a bit faster than OpenCV...But it's licensing causes minor problems if you want to use it for commercial purposes.
you mean the GPL?
Yes, I have read a few comments in different threads where people had to go the opencv route to avoid that license. Again, that's not my case but I guess some industries/companies do not accept GPL in their products?
@Anafeyka Did you try to convert your model to tensorrt ?
@Anafeyka Did you try to convert your model to tensorrt ?
Nope!
@Anafeyka Do you know if its possible to start training my own tiny-spp by starting with your trained weights instead of the tiny-yolov4 ones?
@xaerincl Bad idea. I used a specially corrupted set of training data with incorrect labels.
How is Top -1 accuracy the yolo4-tiny backbone in ImageNet and the AP more than pelee-ssd in coco?
@AlexeyAB Hi,
Can we use random=1 in y-v4-tiny? and should we use num_of_clusters = 6 for calc_anchors?
Thanks
@zpmmehrdad 1 Yes. 2 Yes.
@AlexeyAB thanks for v4_tiny. I want to ask about the model architecture of tiny v4. As yolov4 has backbone of cspdraknet53, Yolov3 has backbone of darknet53 and yolov3_tiny has backbone of darknet53_tiny. Similarly what is the backbone of tiny_v4. I am going to convert tiny yolov4 weights to keras h5 format, so I want to clear this thing before.. Thank U so much.
- yolov3 - darknet53.cfg
- yolov3-tiny - darknet.cfg
- yolov4 - csdarknet53-omega.cfg
- yolov4 - darkv4.cfg
Thanks @AlexeyAB but you mentioned darkv4.cfg .. its not in cfg folder of this repo. I know we have yolov4_tiny.cfg. I was just looking for backbone code for tiny_yolov4, like in this repository. I have to add backbone code. For example for darknet53_tiny the author write backbone in python as follows:
def darknet53_tiny(input_data):
input_data = common.convolutional(input_data, (3, 3, 3, 16))
input_data = tf.keras.layers.MaxPool2D(2, 2, 'same')(input_data)
input_data = common.convolutional(input_data, (3, 3, 16, 32))
input_data = tf.keras.layers.MaxPool2D(2, 2, 'same')(input_data)
input_data = common.convolutional(input_data, (3, 3, 32, 64))
input_data = tf.keras.layers.MaxPool2D(2, 2, 'same')(input_data)
input_data = common.convolutional(input_data, (3, 3, 64, 128))
input_data = tf.keras.layers.MaxPool2D(2, 2, 'same')(input_data)
input_data = common.convolutional(input_data, (3, 3, 128, 256))
route_1 = input_data
input_data = tf.keras.layers.MaxPool2D(2, 2, 'same')(input_data)
input_data = common.convolutional(input_data, (3, 3, 256, 512))
input_data = tf.keras.layers.MaxPool2D(2, 1, 'same')(input_data)
input_data = common.convolutional(input_data, (3, 3, 512, 1024))
return route_1, input_data
and model defination like:
def YOLOv3_tiny(input_layer, NUM_CLASS):
route_1, conv = backbone.darknet53_tiny(input_layer)
conv = common.convolutional(conv, (1, 1, 1024, 256))
conv_lobj_branch = common.convolutional(conv, (3, 3, 256, 512))
conv_lbbox = common.convolutional(conv_lobj_branch, (1, 1, 512, 3 * (NUM_CLASS + 5)), activate=False, bn=False)
conv = common.convolutional(conv, (1, 1, 256, 128))
conv = common.upsample(conv)
conv = tf.concat([conv, route_1], axis=-1)
conv_mobj_branch = common.convolutional(conv, (3, 3, 128, 256))
conv_mbbox = common.convolutional(conv_mobj_branch, (1, 1, 256, 3 * (NUM_CLASS + 5)), activate=False, bn=False)
return [conv_mbbox, conv_lbbox]
I want to write same code for tiny YOLO_v4. Any help or suggestion......
@AlexeyAB 我用官方的yolov4-tiny.weights 和yolov4-tiny.cfg、 模型参数416*416,在RTX2080TI测试dog.jpg图片的时间约20ms,测试时间很慢,怎么和您测试时间差别这么大。求解答
Discussion: https://www.reddit.com/r/MachineLearning/comments/hu7lyt/p_yolov4tiny_speed_1770_fps_tensorrtbatch4/
Full structure: structure of yolov4-tiny.cfg model
YOLOv4-tiny released:
40.2%
AP50,371
FPS (GTX 1080 Ti) /330
FPS (RTX 2070)1770 FPS - on GPU RTX 2080Ti - (416x416, fp16, batch=4) tkDNN/TensorRT https://github.com/ceccocats/tkDNN/issues/59#issuecomment-652269964
1353 FPS - on GPU RTX 2080Ti - (416x416, fp16, batch=4) OpenCV 4.4.0 (including: transfering CPU->GPU and GPU->CPU) (excluding: nms, pre/post-processing) https://github.com/AlexeyAB/darknet/issues/6067#issuecomment-656604015
39 FPS
- 25ms latency - on Jetson Nano - (416x416, fp16, batch=1) tkDNN/TensorRT https://github.com/ceccocats/tkDNN/issues/59#issuecomment-652157334290 FPS
- 3.5ms latency - on Jetson AGX - (416x416, fp16, batch=1) tkDNN/TensorRT https://github.com/ceccocats/tkDNN/issues/59#issuecomment-65215733442 FPS
- on CPU Core i7 7700HQ (4 Cores / 8 Logical Cores) - (416x416, fp16, batch=1) OpenCV 4.4.0 (compiled with OpenVINO backend) https://github.com/AlexeyAB/darknet/issues/6067#issuecomment-65669352920 FPS
on CPU ARM Kirin 990 - Smartphone Huawei P40 https://github.com/AlexeyAB/darknet/issues/6091#issuecomment-651502121 - Tencent/NCNN library https://github.com/Tencent/ncnn120 FPS
on nVidia Jetson AGX Xavier - MAX_N - Darknet framework371
FPS on GPU GTX 1080 Ti - Darknet framework