AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.75k stars 7.96k forks source link

Latency discrepancy with different CPU's (same GPU and other h/w, s/w configuration) #6384

Open kmsravindra opened 4 years ago

kmsravindra commented 4 years ago

Hi @AlexeyAB

I have couple of observations that I wanted to share with you and get your opinion -

  1. I observed that the end to end (capture to display) latency seems to differ with machine having a
    • Intel I7-8700 cpu versus
    • Intel I7-9700 cpu

Same gpu RTX2080Ti GPU and everything else including config etc., is all exactly the same on both the machines. The end to end latency of I7-8700 CPU higher by 60ms on an average than I7-9700 CPU when 60fps input is fed.

fyi.. The latency measurement is end to end ( Latency from Capture TO Display). Measured using camera pointing to a youtube clock video and taking slowmo snapshot with youtube clock video and darknet display video adjacent to each other while darknet is inferring on the youtube clock video. The difference between the timestamps is how the latency measured.

  1. Another observation is that the latency difference seems to be all the more pronounced when the higher input fps of 60fps is fed to both the machines. When 30 input fps is fed to both the machines, then the latency difference is not much. However, the inference time taken is same in both the cases, understandably because it is happening on gpu and so no dependency on input feed fps.

It would be helpful to understand your comments as to how a difference in CPU (Intel I7-8700) have such a significant impact on latency compared to Intel I7-9700 CPU?

Also, do you think more cpu compute is engaged in processing higher number of frames in unit second which is pushing up the latency when input fps is higher?

Are there any image buffers / pre (or post) processing steps that depend a lot on CPU compute that could have resulted in this variation in latency? Could that be happening from the opencv image management during capture / preprocessing / postprocessing?

AlexeyAB commented 4 years ago
  1. Try to compile Darknet as SO/DLL - library and run https://github.com/AlexeyAB/darknet#how-to-compile-on-linux-using-make LD_LIBRARY_PATH=./:$LD_LIBRARY_PATH ./uselib data/coco.names cfg/yolov4.cfg yolov4.weights test.mp4 It is implemented for the lowest latency.

  2. Also compile OpenCV with GStreamer, it will reduce latency more.

The first 2 points give the largest delay.

  1. Video-capturing, resizing, nms, bbox-drawing, videosaving - are implemented on CPU, so it depends on CPU. So if you want to reduce latency even more - you should implement it on GPU by yourself
kmsravindra commented 4 years ago

Thanks for your response @AlexeyAB. I am already using openCV compiled with GStreamer. Will have to try the SO/DLL option. Regarding implementing video capture, resizing, nms etc., on GPU, I think deepstream SDK is providing plugins to exactly do the same but want to check if there is an alternate option using openCV or something else to achieve the same. Are there any references / pointers that you are aware of as to how those could be implemented using opencv directly on CUDA?

kmsravindra commented 4 years ago

Hi @AlexeyAB ,

With reference to your note, could you please explain how running the .so file will result in lower latency compared to invoking the ./darknet demo command? Assuming inference time is the same (around 20ms to 25ms per image), will this shared object run videocapture, image preprocessing / postprocessing more efficiently? The overall latency currently observed is around 80ms to 120ms (from capture to display).

  1. LD_LIBRARY_PATH=./:$LD_LIBRARY_PATH ./uselib data/coco.names cfg/yolov4.cfg yolov4.weights test.mp4 It is implemented for the lowest latency.
AlexeyAB commented 4 years ago

What min/max/avg latency do you get

  1. for ./darknet detector demo ... ? - there are synced all 3 threads for each frame, so latency = max_lat*3

  2. for ./uselib data/coco.names ... ? - there arte all 3 threads work async, so latecny = lat1+lat2+lat3 <= max_lat*3

  3. Try to set false there, recompile and measure latency: https://github.com/AlexeyAB/darknet/blob/master/src/yolo_console_dll.cpp#L297