Open toplinuxsir opened 4 years ago
Did you compiled OpenCV with CUDA?
@KacperPaszkowski Yes, I Compiled Opencv with CUDA Support
Did you used
net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)
?
@KacperPaszkowski Yes I set the backend cuda and target cuda
What GPU do you use?
Try to use diferent backends: https://gist.github.com/YashasSamaga/48bdb167303e10f4d07b754888ddbdcf#file-benchmark-cpp-L36-L45
Try to use this benchmark code to get FPS: https://gist.github.com/YashasSamaga/48bdb167303e10f4d07b754888ddbdcf
I tried to use yolo4 with opencv in Win10,as told by AlexeyAB opencv-master should be compiled ,i used cmake+vs2017+cuda10.2+cudnn7.6.5 but failed with this error "fatal error LNK1181: can not open input file “....\lib\Release\opencv_world440.lib”"for many times. How can i compile the opencv-master for Win10 in the right way?my gpu is tesla v100s.
@codingman2017 I compile opencv 4.4-pre master for ubuntu 20.04 , No idea about compiling for windows 10
Corresponding issue by the same author at OpenCV: https://github.com/opencv/opencv/issues/17795
@codingman2017 Disable opencv_world
.
If you only need DNN with CUDA support, the minimal set of modules required are:
cudev (from opencv_contrib)
opencv_core
opencv_dnn
opencv_imgproc
You might also require the following to read/write/display images and videos:
opencv_imgcodecs
opencv_highgui
opencv_videoio
You can disable the rest.
I got the reason why slow :
net.foward() # is very fast about 88 fps
But to do thresh filter , took long time about 44ms
if lastLayer.type == 'DetectionOutput':
# Network produces output blob with a shape 1x1xNx7 where N is a number of
# detections and an every detection is a vector of values
# [batchId, classId, confidence, left, top, right, bottom]
for out in outs:
for detection in out[0, 0]:
confidence = detection[2]
if confidence > confThreshold:
left = int(detection[3])
top = int(detection[4])
right = int(detection[5])
bottom = int(detection[6])
width = right - left + 1
height = bottom - top + 1
if width <= 2 or height <= 2:
left = int(detection[3] * frameWidth)
top = int(detection[4] * frameHeight)
right = int(detection[5] * frameWidth)
bottom = int(detection[6] * frameHeight)
width = right - left + 1
height = bottom - top + 1
classIds.append(int(detection[1]) - 1) # Skip background label
confidences.append(float(confidence))
boxes.append([left, top, width, height])
elif lastLayer.type == 'Region':
# Network produces output blob with a shape NxC where N is a number of
# detected objects and C is a number of classes + 4 where the first 4
# numbers are [center_x, center_y, width, height]
for out in outs:
for detection in out:
scores = detection[5:]
classId = np.argmax(scores)
confidence = scores[classId]
if confidence > confThreshold:
center_x = int(detection[0] * frameWidth)
center_y = int(detection[1] * frameHeight)
width = int(detection[2] * frameWidth)
height = int(detection[3] * frameHeight)
left = int(center_x - width / 2)
top = int(center_y - height / 2)
classIds.append(classId)
confidences.append(float(confidence))
boxes.append([left, top, width, height])
else:
print('Unknown output layer type: ' + lastLayer.type)
exit()
# NMS is used inside Region layer only on DNN_BACKEND_OPENCV for another backends we need NMS in sample
# or NMS is required if number of outputs > 1
if len(outNames) > 1 or lastLayer.type == 'Region' and args.backend != cv.dnn.DNN_BACKEND_OPENCV:
indices = []
classIds = np.array(classIds)
boxes = np.array(boxes)
confidences = np.array(confidences)
unique_classes = set(classIds)
for cl in unique_classes:
class_indices = np.where(classIds == cl)[0]
conf = confidences[class_indices]
box = boxes[class_indices].tolist()
nms_indices = cv.dnn.NMSBoxes(box, conf, confThreshold, nmsThreshold)
nms_indices = nms_indices[:, 0] if len(nms_indices) else []
indices.extend(class_indices[nms_indices])
else:
indices = np.arange(0, len(classIds))
for i in indices:
box = boxes[i]
left = box[0]
top = box[1]
width = box[2]
height = box[3]
drawPred(classIds[i], confidences[i], left, top, left + width, top + height)
@toplinuxsir use DetectionModel and your pre-postprocessing will become very fast.
@YashasSamaga Any Python example ? Thanks !
@toplinuxsir Note that the first model.detect
will be slow. Subsequent predictions will be fast.
import cv2
CONFIDENCE_THRESHOLD = 0.2
NMS_THRESHOLD = 0.4
image = cv2.imread("dog.jpg")
model = cv2.dnn_DetectionModel("yolov4.weights", "yolov4.cfg")
model.setInputParams(size=(416, 416), scale=1/256)
classes, scores, boxes = model.detect(image, CONFIDENCE_THRESHOLD, NMS_THRESHOLD)
num_detections = len(boxes)
for (classid, score, box) in zip(classes, scores, boxes):
print(classid, score, box)
Output:
[1] [0.9838313] [128 127 440 294]
[7] [0.85620713] [465 76 224 95]
[16] [0.9921662] [133 234 177 303]
/cc @KacperPaszkowski
Sorry, I forgot to set the backend.
image = cv2.imread("dog.jpg")
net = cv2.dnn.readNet("yolov4.weights", "yolov4.cfg")
net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)
model = cv2.dnn_DetectionModel(net)
model.setInputParams(size=(416, 416), scale=1/256)
classes, scores, boxes = model.detect(image, CONFIDENCE_THRESHOLD, NMS_THRESHOLD)
num_detections = len(boxes)
for (classid, score, box) in zip(classes, scores, boxes):
print(classid, score, box)
@toplinuxsir @KacperPaszkowski
@YashasSamaga Thanks , It works!
With postprocessing , The opencv fp16 performance is the same as darknet lib (master)
@toplinuxsir OpenCV should be much faster than Darknet. Can you try the following standalone program? It will report you the FPS (preprocessing + inference + postprocessing) and time spent drawing (not counted in FPS).
import cv2
import time
CONFIDENCE_THRESHOLD = 0.2
NMS_THRESHOLD = 0.4
COLORS = [(0, 255, 255), (255, 255, 0), (0, 255, 0), (255, 0, 0)]
class_names = []
with open("classes.txt", "r") as f:
class_names = [cname.strip() for cname in f.readlines()]
vc = cv2.VideoCapture("demo.mp4")
net = cv2.dnn.readNet("yolov4.weights", "yolov4.cfg")
net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA_FP16)
model = cv2.dnn_DetectionModel(net)
model.setInputParams(size=(416, 416), scale=1/256)
while cv2.waitKey(1) < 1:
(grabbed, frame) = vc.read()
if not grabbed:
exit()
start = time.time()
classes, scores, boxes = model.detect(frame, CONFIDENCE_THRESHOLD, NMS_THRESHOLD)
end = time.time()
start_drawing = time.time()
for (classid, score, box) in zip(classes, scores, boxes):
color = COLORS[int(classid) % len(COLORS)]
label = "%s : %f" % (class_names[classid[0]], score)
cv2.rectangle(frame, box, color, 2)
cv2.putText(frame, label, (box[0], box[1] - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)
end_drawing = time.time()
fps_label = "FPS: %.2f (excluding drawing time of %.2fms)" % (1 / (end - start), (end_drawing - start_drawing) * 1000)
cv2.putText(frame, fps_label, (0, 25), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 0, 0), 2)
cv2.imshow("detections", frame)
@toplinuxsir Note that the first
model.detect
will be slow. Subsequent predictions will be fast.import cv2 CONFIDENCE_THRESHOLD = 0.2 NMS_THRESHOLD = 0.4 image = cv2.imread("dog.jpg") model = cv2.dnn_DetectionModel("yolov4.weights", "yolov4.cfg") model.setInputParams(size=(416, 416), scale=1/256) classes, scores, boxes = model.detect(image, CONFIDENCE_THRESHOLD, NMS_THRESHOLD) num_detections = len(boxes) for (classid, score, box) in zip(classes, scores, boxes): print(classid, score, box)
Output:
[1] [0.9838313] [128 127 440 294] [7] [0.85620713] [465 76 224 95] [16] [0.9921662] [133 234 177 303]
/cc @KacperPaszkowski
I run your code but always get this error, what is the problem?
<class 'cv2.dnn_DetectionModel'> returned a result with an error set @AlexeyAB @YashasSamaga
@toplinuxsir OpenCV should be much faster than Darknet. Can you try the following standalone program? It will report you the FPS (preprocessing + inference + postprocessing) and time spent drawing (not counted in FPS).
import cv2 import time CONFIDENCE_THRESHOLD = 0.2 NMS_THRESHOLD = 0.4 COLORS = [(0, 255, 255), (255, 255, 0), (0, 255, 0), (255, 0, 0)] class_names = [] with open("classes.txt", "r") as f: class_names = [cname.strip() for cname in f.readlines()] vc = cv2.VideoCapture("demo.mp4") net = cv2.dnn.readNet("yolov4.weights", "yolov4.cfg") net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA) net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA_FP16) model = cv2.dnn_DetectionModel(net) model.setInputParams(size=(416, 416), scale=1/256) while cv2.waitKey(1) < 1: (grabbed, frame) = vc.read() if not grabbed: exit() start = time.time() classes, scores, boxes = model.detect(frame, CONFIDENCE_THRESHOLD, NMS_THRESHOLD) end = time.time() start_drawing = time.time() for (classid, score, box) in zip(classes, scores, boxes): color = COLORS[int(classid) % len(COLORS)] label = "%s : %f" % (class_names[classid[0]], score) cv2.rectangle(frame, box, color, 2) cv2.putText(frame, label, (box[0], box[1] - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2) end_drawing = time.time() fps_label = "FPS: %.2f (excluding drawing time of %.2fms)" % (1 / (end - start), (end_drawing - start_drawing) * 1000) cv2.putText(frame, fps_label, (0, 25), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 0, 0), 2) cv2.imshow("detections", frame)
and when I run this code I get this error: error: (-212:Parsing error) Unsupported activation: mish in function 'ReadDarknetFromCfgStream' can you please help me ? @AlexeyAB @YashasSamaga
@imohamadhoseins You need the latest master for mish activation.
@imohamadhoseins You need the latest master for mish activation.
can you please explain more?
@imohamadhoseins Mish activation support was added to OpenCV after the last release (4.3.0). It's part of the next release. You have to clone OpenCV and build master branch to use YOLOv4.
@imohamadhoseins Mish activation support was added to OpenCV after the last release (4.3.0). It's part of the next release. You have to clone OpenCV and build master branch to use YOLOv4.
thankyou so much!
@YashasSamaga Thank you very much , Yes I tested your code , It is much faster than darknet excluding the drawinig time .
Hi, I'm using OpenCV 4.4 with Cuda 11 and cuDNN 8 and RTX 2080 ti but I could get 13 fps with DNN_TARGET_CUDA_FP16 and 28 fps DNN_TARGET_CUDA. Where is the problem?
@zpmmehrdad Please share the code you used.
cuDNN 8
It could also be due to cuDNN 8. It's also weird that you have a lower FPS in FP16 target than FP32 target. cuDNN 8.0.2 is for some reason slower on GTX 1050 in both OpenCV (~1.3x slower) and Darknet (~2x slower) compared to cuDNN 7.6.5.
How are you measuring the time? Does it include video capture, drawing, etc? Note that the first forward pass will be slow due to initialization.
Can you try running this and report what FPS you get?
I have experienced the same performance problem, opencv 4.4 + cuda11.0 +cudnn 8.0 , only 10.0 fps (FP16) but: opencv 4.4 + cuda 10.2 +cudnn 7.6.5 32 fps(FP16)
@toplinuxsir what device are you using?
@YashasSamaga RTX 2080Ti
@toplinuxsir OpenCV should be much faster than Darknet. Can you try the following standalone program? It will report you the FPS (preprocessing + inference + postprocessing) and time spent drawing (not counted in FPS).
import cv2 import time CONFIDENCE_THRESHOLD = 0.2 NMS_THRESHOLD = 0.4 COLORS = [(0, 255, 255), (255, 255, 0), (0, 255, 0), (255, 0, 0)] class_names = [] with open("classes.txt", "r") as f: class_names = [cname.strip() for cname in f.readlines()] vc = cv2.VideoCapture("demo.mp4") net = cv2.dnn.readNet("yolov4.weights", "yolov4.cfg") net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA) net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA_FP16) model = cv2.dnn_DetectionModel(net) model.setInputParams(size=(416, 416), scale=1/256) while cv2.waitKey(1) < 1: (grabbed, frame) = vc.read() if not grabbed: exit() start = time.time() classes, scores, boxes = model.detect(frame, CONFIDENCE_THRESHOLD, NMS_THRESHOLD) end = time.time() start_drawing = time.time() for (classid, score, box) in zip(classes, scores, boxes): color = COLORS[int(classid) % len(COLORS)] label = "%s : %f" % (class_names[classid[0]], score) cv2.rectangle(frame, box, color, 2) cv2.putText(frame, label, (box[0], box[1] - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2) end_drawing = time.time() fps_label = "FPS: %.2f (excluding drawing time of %.2fms)" % (1 / (end - start), (end_drawing - start_drawing) * 1000) cv2.putText(frame, fps_label, (0, 25), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 0, 0), 2) cv2.imshow("detections", frame)
I tested FPS: 5.29 (excluding drawing time of 8.28ms) why?
@yancccc What device, version of CUDA Toolkit and cuDNN are you using?
@yancccc What device, version of CUDA Toolkit and cuDNN are you using?
ycc@ycc:~/opencv$ python yolov4.py 一张图检测耗时:1.461秒 [1] [0.98476726] [128 128 440 294] [7] [0.858897] [466 76 224 95] [16] [0.99192023] [133 234 177 303]
import cv2 import time
CONFIDENCE_THRESHOLD = 0.2 NMS_THRESHOLD = 0.4
image = cv2.imread("dog.jpg")
net = cv2.dnn.readNet("yolov4.weights", "yolov4.cfg") net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA) net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)
model = cv2.dnn_DetectionModel(net) model.setInputParams(size=(416, 416), scale=1/256) s = time.time() classes, scores, boxes = model.detect(image, CONFIDENCE_THRESHOLD, NMS_THRESHOLD) print("一张图检测耗时:%.3f秒" % (time.time() - s)) num_detections = len(boxes) for (classid, score, box) in zip(classes, scores, boxes): print(classid, score, box)
Intel® Core™ i7-9700K CPU GeForce GTX 1080
NVIDIA CUDA: YES(ver10.0 ,CUFFT CUBLAS FAST_MATH) NVIDIA GPU arch: 60 61 NVIDIA PTX archs: cuDNN: YES(ver 7.6.4)
Using GPU is slower than using CPU,why?
import cv2 import time
CONFIDENCE_THRESHOLD = 0.2 NMS_THRESHOLD = 0.4
image = cv2.imread("dog.jpg")
net = cv2.dnn.readNet("yolov4.weights", "yolov4.cfg")
model = cv2.dnn_DetectionModel(net) model.setInputParams(size=(416, 416), scale=1/256) s = time.time() classes, scores, boxes = model.detect(image, CONFIDENCE_THRESHOLD, NMS_THRESHOLD) print("一张图检测耗时:%.3f秒" % (time.time() - s)) num_detections = len(boxes) for (classid, score, box) in zip(classes, scores, boxes): print(classid, score, box)
ycc@ycc:~/opencv$ python yolov4.py 一张图检测耗时:0.416秒 [1] [0.98476726] [128 128 440 294] [7] [0.85889685] [466 76 224 95] [16] [0.99192023] [133 234 177 303]
@yancccc
DNN_TARGET_CUDA_FP16
to DNN_TARGET_CUDA
and run yolov4.py
again.@yancccc
1. The first forward pass is very slow as it also does initialization. Your latest reply is measuring inference + initialization time. 2. GTX 1080 doesn't give good FP16 performance. Please change `DNN_TARGET_CUDA_FP16` to `DNN_TARGET_CUDA` and run `yolov4.py` again.
thanks a lot.
I use opencv4 to load yolov4 model and detection is 10fps, But If I use darknet to detect with same model is 30fps. Any suggestion ? Thanks