toplinuxsir commented 4 years ago

I use opencv4 to load yolov4 model and detection is 10fps, But If I use darknet to detect with same model is 30fps. Any suggestion ? Thanks

KacperPaszkowski commented 4 years ago

Did you compiled OpenCV with CUDA?

toplinuxsir commented 4 years ago

@KacperPaszkowski Yes, I Compiled Opencv with CUDA Support

KacperPaszkowski commented 4 years ago

Did you used net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA) net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA) ?

toplinuxsir commented 4 years ago

@KacperPaszkowski Yes I set the backend cuda and target cuda

AlexeyAB commented 4 years ago

What GPU do you use?

Try to use diferent backends: https://gist.github.com/YashasSamaga/48bdb167303e10f4d07b754888ddbdcf#file-benchmark-cpp-L36-L45

Try to use this benchmark code to get FPS: https://gist.github.com/YashasSamaga/48bdb167303e10f4d07b754888ddbdcf

codingman2017 commented 4 years ago

I tried to use yolo4 with opencv in Win10,as told by AlexeyAB opencv-master should be compiled ,i used cmake+vs2017+cuda10.2+cudnn7.6.5 but failed with this error "fatal error LNK1181: can not open input file “....\lib\Release\opencv_world440.lib”"for many times. How can i compile the opencv-master for Win10 in the right way?my gpu is tesla v100s.

toplinuxsir commented 4 years ago

@codingman2017 I compile opencv 4.4-pre master for ubuntu 20.04 , No idea about compiling for windows 10

YashasSamaga commented 4 years ago

Corresponding issue by the same author at OpenCV: https://github.com/opencv/opencv/issues/17795

@codingman2017 Disable opencv_world.

If you only need DNN with CUDA support, the minimal set of modules required are:

cudev (from opencv_contrib)
opencv_core
opencv_dnn
opencv_imgproc

You might also require the following to read/write/display images and videos:

opencv_imgcodecs
opencv_highgui
opencv_videoio

You can disable the rest.

toplinuxsir commented 4 years ago

I got the reason why slow :

net.foward() # is very fast  about 88 fps

But to do thresh filter , took long time about 44ms

 if lastLayer.type == 'DetectionOutput':
        # Network produces output blob with a shape 1x1xNx7 where N is a number of
        # detections and an every detection is a vector of values
        # [batchId, classId, confidence, left, top, right, bottom]
        for out in outs:
            for detection in out[0, 0]:
                confidence = detection[2]
                if confidence > confThreshold:
                    left = int(detection[3])
                    top = int(detection[4])
                    right = int(detection[5])
                    bottom = int(detection[6])
                    width = right - left + 1
                    height = bottom - top + 1
                    if width <= 2 or height <= 2:
                        left = int(detection[3] * frameWidth)
                        top = int(detection[4] * frameHeight)
                        right = int(detection[5] * frameWidth)
                        bottom = int(detection[6] * frameHeight)
                        width = right - left + 1
                        height = bottom - top + 1
                    classIds.append(int(detection[1]) - 1)  # Skip background label
                    confidences.append(float(confidence))
                    boxes.append([left, top, width, height])
    elif lastLayer.type == 'Region':
        # Network produces output blob with a shape NxC where N is a number of
        # detected objects and C is a number of classes + 4 where the first 4
        # numbers are [center_x, center_y, width, height]
        for out in outs:
            for detection in out:
                scores = detection[5:]
                classId = np.argmax(scores)
                confidence = scores[classId]
                if confidence > confThreshold:
                    center_x = int(detection[0] * frameWidth)
                    center_y = int(detection[1] * frameHeight)
                    width = int(detection[2] * frameWidth)
                    height = int(detection[3] * frameHeight)
                    left = int(center_x - width / 2)
                    top = int(center_y - height / 2)
                    classIds.append(classId)
                    confidences.append(float(confidence))
                    boxes.append([left, top, width, height])
    else:
        print('Unknown output layer type: ' + lastLayer.type)
        exit()

    # NMS is used inside Region layer only on DNN_BACKEND_OPENCV for another backends we need NMS in sample
    # or NMS is required if number of outputs > 1
    if len(outNames) > 1 or lastLayer.type == 'Region' and args.backend != cv.dnn.DNN_BACKEND_OPENCV:
        indices = []
        classIds = np.array(classIds)
        boxes = np.array(boxes)
        confidences = np.array(confidences)
        unique_classes = set(classIds)
        for cl in unique_classes:
            class_indices = np.where(classIds == cl)[0]
            conf = confidences[class_indices]
            box  = boxes[class_indices].tolist()
            nms_indices = cv.dnn.NMSBoxes(box, conf, confThreshold, nmsThreshold)
            nms_indices = nms_indices[:, 0] if len(nms_indices) else []
            indices.extend(class_indices[nms_indices])
    else:
        indices = np.arange(0, len(classIds))

    for i in indices:
        box = boxes[i]
        left = box[0]
        top = box[1]
        width = box[2]
        height = box[3]
        drawPred(classIds[i], confidences[i], left, top, left + width, top + height)

YashasSamaga commented 4 years ago

@toplinuxsir use DetectionModel and your pre-postprocessing will become very fast.

toplinuxsir commented 4 years ago

@YashasSamaga Any Python example ? Thanks !

YashasSamaga commented 4 years ago

@toplinuxsir Note that the first model.detect will be slow. Subsequent predictions will be fast.

import cv2

CONFIDENCE_THRESHOLD = 0.2
NMS_THRESHOLD = 0.4

image = cv2.imread("dog.jpg")

model = cv2.dnn_DetectionModel("yolov4.weights", "yolov4.cfg")
model.setInputParams(size=(416, 416), scale=1/256)
classes, scores, boxes = model.detect(image, CONFIDENCE_THRESHOLD, NMS_THRESHOLD)

num_detections = len(boxes)
for (classid, score, box) in zip(classes, scores, boxes):
    print(classid, score, box)

Output:

[1] [0.9838313] [128 127 440 294]
[7] [0.85620713] [465  76 224  95]
[16] [0.9921662] [133 234 177 303]

/cc @KacperPaszkowski

YashasSamaga commented 4 years ago

Sorry, I forgot to set the backend.

image = cv2.imread("dog.jpg")

net = cv2.dnn.readNet("yolov4.weights", "yolov4.cfg")
net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)

model = cv2.dnn_DetectionModel(net)
model.setInputParams(size=(416, 416), scale=1/256)
classes, scores, boxes = model.detect(image, CONFIDENCE_THRESHOLD, NMS_THRESHOLD)

num_detections = len(boxes)
for (classid, score, box) in zip(classes, scores, boxes):
    print(classid, score, box)

@toplinuxsir @KacperPaszkowski

toplinuxsir commented 4 years ago

@YashasSamaga Thanks , It works!

toplinuxsir commented 4 years ago

With postprocessing , The opencv fp16 performance is the same as darknet lib (master)

YashasSamaga commented 4 years ago

@toplinuxsir OpenCV should be much faster than Darknet. Can you try the following standalone program? It will report you the FPS (preprocessing + inference + postprocessing) and time spent drawing (not counted in FPS).

import cv2
import time

CONFIDENCE_THRESHOLD = 0.2
NMS_THRESHOLD = 0.4
COLORS = [(0, 255, 255), (255, 255, 0), (0, 255, 0), (255, 0, 0)]

class_names = []
with open("classes.txt", "r") as f:
    class_names = [cname.strip() for cname in f.readlines()]

vc = cv2.VideoCapture("demo.mp4")

net = cv2.dnn.readNet("yolov4.weights", "yolov4.cfg")
net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA_FP16)

model = cv2.dnn_DetectionModel(net)
model.setInputParams(size=(416, 416), scale=1/256)

while cv2.waitKey(1) < 1:
    (grabbed, frame) = vc.read()
    if not grabbed:
        exit()

    start = time.time()
    classes, scores, boxes = model.detect(frame, CONFIDENCE_THRESHOLD, NMS_THRESHOLD)
    end = time.time()

    start_drawing = time.time()
    for (classid, score, box) in zip(classes, scores, boxes):
        color = COLORS[int(classid) % len(COLORS)]
        label = "%s : %f" % (class_names[classid[0]], score)
        cv2.rectangle(frame, box, color, 2)
        cv2.putText(frame, label, (box[0], box[1] - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)
    end_drawing = time.time()

    fps_label = "FPS: %.2f (excluding drawing time of %.2fms)" % (1 / (end - start), (end_drawing - start_drawing) * 1000)
    cv2.putText(frame, fps_label, (0, 25), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 0, 0), 2)
    cv2.imshow("detections", frame)

imohammadhossein commented 4 years ago

@toplinuxsir Note that the first model.detect will be slow. Subsequent predictions will be fast.

import cv2

CONFIDENCE_THRESHOLD = 0.2
NMS_THRESHOLD = 0.4

image = cv2.imread("dog.jpg")

model = cv2.dnn_DetectionModel("yolov4.weights", "yolov4.cfg")
model.setInputParams(size=(416, 416), scale=1/256)
classes, scores, boxes = model.detect(image, CONFIDENCE_THRESHOLD, NMS_THRESHOLD)

num_detections = len(boxes)
for (classid, score, box) in zip(classes, scores, boxes):
    print(classid, score, box)

Output:

[1] [0.9838313] [128 127 440 294]
[7] [0.85620713] [465  76 224  95]
[16] [0.9921662] [133 234 177 303]

/cc @KacperPaszkowski

I run your code but always get this error, what is the problem?

<class 'cv2.dnn_DetectionModel'> returned a result with an error set @AlexeyAB @YashasSamaga

imohammadhossein commented 4 years ago

@toplinuxsir OpenCV should be much faster than Darknet. Can you try the following standalone program? It will report you the FPS (preprocessing + inference + postprocessing) and time spent drawing (not counted in FPS).

import cv2
import time

CONFIDENCE_THRESHOLD = 0.2
NMS_THRESHOLD = 0.4
COLORS = [(0, 255, 255), (255, 255, 0), (0, 255, 0), (255, 0, 0)]

class_names = []
with open("classes.txt", "r") as f:
    class_names = [cname.strip() for cname in f.readlines()]

vc = cv2.VideoCapture("demo.mp4")

net = cv2.dnn.readNet("yolov4.weights", "yolov4.cfg")
net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA_FP16)

model = cv2.dnn_DetectionModel(net)
model.setInputParams(size=(416, 416), scale=1/256)

while cv2.waitKey(1) < 1:
    (grabbed, frame) = vc.read()
    if not grabbed:
        exit()

    start = time.time()
    classes, scores, boxes = model.detect(frame, CONFIDENCE_THRESHOLD, NMS_THRESHOLD)
    end = time.time()

    start_drawing = time.time()
    for (classid, score, box) in zip(classes, scores, boxes):
        color = COLORS[int(classid) % len(COLORS)]
        label = "%s : %f" % (class_names[classid[0]], score)
        cv2.rectangle(frame, box, color, 2)
        cv2.putText(frame, label, (box[0], box[1] - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)
    end_drawing = time.time()

    fps_label = "FPS: %.2f (excluding drawing time of %.2fms)" % (1 / (end - start), (end_drawing - start_drawing) * 1000)
    cv2.putText(frame, fps_label, (0, 25), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 0, 0), 2)
    cv2.imshow("detections", frame)

and when I run this code I get this error: error: (-212:Parsing error) Unsupported activation: mish in function 'ReadDarknetFromCfgStream' can you please help me ? @AlexeyAB @YashasSamaga

YashasSamaga commented 4 years ago

@imohamadhoseins You need the latest master for mish activation.

imohammadhossein commented 4 years ago

@imohamadhoseins You need the latest master for mish activation.

can you please explain more?

YashasSamaga commented 4 years ago

@imohamadhoseins Mish activation support was added to OpenCV after the last release (4.3.0). It's part of the next release. You have to clone OpenCV and build master branch to use YOLOv4.

imohammadhossein commented 4 years ago

@imohamadhoseins Mish activation support was added to OpenCV after the last release (4.3.0). It's part of the next release. You have to clone OpenCV and build master branch to use YOLOv4.

thankyou so much!

toplinuxsir commented 4 years ago

@YashasSamaga Thank you very much , Yes I tested your code , It is much faster than darknet excluding the drawinig time .

sctrueew commented 4 years ago

Hi, I'm using OpenCV 4.4 with Cuda 11 and cuDNN 8 and RTX 2080 ti but I could get 13 fps with DNN_TARGET_CUDA_FP16 and 28 fps DNN_TARGET_CUDA. Where is the problem?

YashasSamaga commented 4 years ago

@zpmmehrdad Please share the code you used.

cuDNN 8

It could also be due to cuDNN 8. It's also weird that you have a lower FPS in FP16 target than FP32 target. cuDNN 8.0.2 is for some reason slower on GTX 1050 in both OpenCV (~1.3x slower) and Darknet (~2x slower) compared to cuDNN 7.6.5.

sctrueew commented 4 years ago

@YashasSamaga Hi, code.txt

Thanks

YashasSamaga commented 4 years ago

How are you measuring the time? Does it include video capture, drawing, etc? Note that the first forward pass will be slow due to initialization.

Can you try running this and report what FPS you get?

sctrueew commented 4 years ago

@YashasSamaga

Sorry, I used this for testing. I shared a mistake

toplinuxsir commented 4 years ago

I have experienced the same performance problem, opencv 4.4 + cuda11.0 +cudnn 8.0 , only 10.0 fps (FP16) but: opencv 4.4 + cuda 10.2 +cudnn 7.6.5 32 fps(FP16)

YashasSamaga commented 4 years ago

@toplinuxsir what device are you using?

toplinuxsir commented 4 years ago

@YashasSamaga RTX 2080Ti

yancccc commented 4 years ago

@toplinuxsir OpenCV should be much faster than Darknet. Can you try the following standalone program? It will report you the FPS (preprocessing + inference + postprocessing) and time spent drawing (not counted in FPS).

import cv2
import time

CONFIDENCE_THRESHOLD = 0.2
NMS_THRESHOLD = 0.4
COLORS = [(0, 255, 255), (255, 255, 0), (0, 255, 0), (255, 0, 0)]

class_names = []
with open("classes.txt", "r") as f:
    class_names = [cname.strip() for cname in f.readlines()]

vc = cv2.VideoCapture("demo.mp4")

net = cv2.dnn.readNet("yolov4.weights", "yolov4.cfg")
net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA_FP16)

model = cv2.dnn_DetectionModel(net)
model.setInputParams(size=(416, 416), scale=1/256)

while cv2.waitKey(1) < 1:
    (grabbed, frame) = vc.read()
    if not grabbed:
        exit()

    start = time.time()
    classes, scores, boxes = model.detect(frame, CONFIDENCE_THRESHOLD, NMS_THRESHOLD)
    end = time.time()

    start_drawing = time.time()
    for (classid, score, box) in zip(classes, scores, boxes):
        color = COLORS[int(classid) % len(COLORS)]
        label = "%s : %f" % (class_names[classid[0]], score)
        cv2.rectangle(frame, box, color, 2)
        cv2.putText(frame, label, (box[0], box[1] - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)
    end_drawing = time.time()

    fps_label = "FPS: %.2f (excluding drawing time of %.2fms)" % (1 / (end - start), (end_drawing - start_drawing) * 1000)
    cv2.putText(frame, fps_label, (0, 25), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 0, 0), 2)
    cv2.imshow("detections", frame)

I tested FPS: 5.29 (excluding drawing time of 8.28ms) why?

YashasSamaga commented 4 years ago

@yancccc What device, version of CUDA Toolkit and cuDNN are you using?

yancccc commented 4 years ago

@yancccc What device, version of CUDA Toolkit and cuDNN are you using?

ycc@ycc:~/opencv$ python yolov4.py 一张图检测耗时：1.461秒 [1] [0.98476726] [128 128 440 294] [7] [0.858897] [466 76 224 95] [16] [0.99192023] [133 234 177 303]

yancccc commented 4 years ago

import cv2 import time

CONFIDENCE_THRESHOLD = 0.2 NMS_THRESHOLD = 0.4

image = cv2.imread("dog.jpg")

net = cv2.dnn.readNet("yolov4.weights", "yolov4.cfg") net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA) net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)

model = cv2.dnn_DetectionModel(net) model.setInputParams(size=(416, 416), scale=1/256) s = time.time() classes, scores, boxes = model.detect(image, CONFIDENCE_THRESHOLD, NMS_THRESHOLD) print("一张图检测耗时：%.3f秒" % (time.time() - s)) num_detections = len(boxes) for (classid, score, box) in zip(classes, scores, boxes): print(classid, score, box)

yancccc commented 4 years ago

Intel® Core™ i7-9700K CPU GeForce GTX 1080

NVIDIA CUDA： YES（ver10.0 ，CUFFT CUBLAS FAST_MATH） NVIDIA GPU arch: 60 61 NVIDIA PTX archs: cuDNN: YES(ver 7.6.4)

yancccc commented 4 years ago

Using GPU is slower than using CPU，why？

yancccc commented 4 years ago

import cv2 import time

CONFIDENCE_THRESHOLD = 0.2 NMS_THRESHOLD = 0.4

image = cv2.imread("dog.jpg")

net = cv2.dnn.readNet("yolov4.weights", "yolov4.cfg")

net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)

net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)

model = cv2.dnn_DetectionModel(net) model.setInputParams(size=(416, 416), scale=1/256) s = time.time() classes, scores, boxes = model.detect(image, CONFIDENCE_THRESHOLD, NMS_THRESHOLD) print("一张图检测耗时：%.3f秒" % (time.time() - s)) num_detections = len(boxes) for (classid, score, box) in zip(classes, scores, boxes): print(classid, score, box)

ycc@ycc:~/opencv$ python yolov4.py 一张图检测耗时：0.416秒 [1] [0.98476726] [128 128 440 294] [7] [0.85889685] [466 76 224 95] [16] [0.99192023] [133 234 177 303]

YashasSamaga commented 4 years ago

@yancccc

The first forward pass is very slow as it also does initialization. Your latest reply is measuring inference + initialization time.
GTX 1080 doesn't give good FP16 performance. Please change DNN_TARGET_CUDA_FP16 to DNN_TARGET_CUDA and run yolov4.py again.

yancccc commented 4 years ago

@yancccc

1. The first forward pass is very slow as it also does initialization. Your latest reply is measuring inference + initialization time.

2. GTX 1080 doesn't give good FP16 performance. Please change `DNN_TARGET_CUDA_FP16` to `DNN_TARGET_CUDA` and run `yolov4.py` again.

thanks a lot.

AlexeyAB / darknet

opencv4 with yolov4 is more slow than darknet #6195

net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)

net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)