Open ryumansang opened 5 years ago
@ryumansang Hi,
NVIDIA Driver 410.48 CUDA 10.0 & CuDN 7.3.0
Yes, it should be enough.
So you can compile with GPU=1 CUDNN=1 CUDNN_HALF=0 OPENCV=1
or with GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1
in the Makefile
.
Thank you.
I have one other question.
I currently have four configurations: gtx1060, gx1080 ti, tesla v100, and rtx2080 ti. When tested using data/dog.jpg and yolov3
gx1060 is 0.036 sec -> cuda 8.0 gx1080 ti is 0.033 sec -> cuda 8.0 tesla v100 is 0.034 sec (tensor 0.014 sec) -> cuda 9.1 & nvidia 396.51 driver rtx2080 ti is 0.019 sec(tensor 0.014 sec) -> cuda 10 & the latest nvidia driver
The detection speed is shown above.
I suspect that the speed of testla v100 is not too slow.
Are the specifications normal?
Are the specifications normal?
Yes.
I got 0.031 sec
with CUDNN_HALF=0
,
and 0.011 sec
with CUDNN_HALF=1
on Tesla V100: https://github.com/AlexeyAB/darknet/issues/407
Also you can try to comment this line: https://github.com/AlexeyAB/darknet/blob/31df5e27356b6b11ffd43baace9afdd3800a8aa2/src/convolutional_layer.c#L164
And test with CUDNN_HALF=0
on rtx2080 ti - it will forcibly disable Tensor Cores at all.
Bucause with this line the Tensor Cores can still be used internally in cuDNN even with CUDNN_HALF=0
for Float 32-bit, by using atomatic conversion FP32->FP16->FP32. So may be 0.019 sec is on Tensor Cores too.
Also you can try to use INT8 on Tensor Cores with only - 1-2% mAP decreasing for Detection by using this repo: https://github.com/AlexeyAB/yolo2_light
-quantized
flag:
./darknet detector test coco.names yolov3.cfg yolov3.weights -thresh 0.24 dog.jpg -quantized
./darknet detector demo coco.names yolov3.cfg yolov3.weights -thresh 0.24 test.mp4 -quantized
Thank you! Let me test it
What is your frame rate on a video with the 2080ti and CUDNN_HALF=1 ?
thanks !
I tried compiling with CUDNN_HALF=1 for an 2080ti setup but am getting a very slim speedup (I'd say 15%). Far from the 2.5x claimed so I'm obviously doing something wrong. Is compiling with OpenCV required for this (I don't see why it would be but asking anyway).
Testing with the Python implementation. Perhaps I need to explicitly cast to fp16 before calling python's detect method?
@nirbenz
Testing with the Python implementation. Perhaps I need to explicitly cast to fp16 before calling python's detect method?
No.
I simply call darknet.py
's detect
method on an image read via Darknet's load_image
method. I'm only timing the actual function call to get_network_boxes
as to minimize possible Python overhead. Still the speed increase is a lot smaller.
EDIT: Unless I'm missing something; no changes are required to cfg file, weights, etc. - only compilation with the HALF flag, right? I carefully read through relevant issues and found no such requirements but making sure.
@nirbenz
I simply call darknet.py's detect method on an image read via Darknet's load_image method. I'm only timing the actual function call to get_network_boxes as to minimize possible Python overhead. Still the speed increase is a lot smaller.
get_network_boxes()
isn't a function of Neural network inference.
This is the network inference function: predict_image(net, im)
: https://github.com/AlexeyAB/darknet/blob/21a4ec9390b61c0baa7ef72e72e59fa143daba4c/darknet.py#L238
actually it calls this C function that can have some overheads: https://github.com/AlexeyAB/darknet/blob/21a4ec9390b61c0baa7ef72e72e59fa143daba4c/src/network.c#L652-L660
darknet.py
has overheads and current python code may be a bottleneck on high perfomrance GPU
You shouldn't change anything in the source code or in cfg-file to use Tensor Cores on Geforce RTX 2080Ti, just set GPU=1 CUDNN=1 OPENCV=1 CUDNN_HALF=1
in the Makefile.
OPENCV=1
required?
get_network_boxes()
isn't a function of Neural network inference. This is the network inference function:predict_image(net, im)
:
This is what I meant, actually. I wrap the measured timing around this method (as well as subsequent ones). The image is already resized to correct size to avoid possible slowdown due to image resizing and letterboxing. The goal is to only measure the actual inference call (and that's how I do it).
@nirbenz
OPENCV=1 is required only for Training, so data augmentation will not be a bottleneck for GPU with Tensor Cores.
In your case OPENCV=1
isn't required, because in your case OpenCV library is used here if you un-comment these lines: https://github.com/AlexeyAB/darknet/blob/21a4ec9390b61c0baa7ef72e72e59fa143daba4c/darknet.py#L227-L229
What FPS can you get by using ./darknet detector demo cfg/coco.data cfg/yolov3.cfg yolov3.weights test.mp4
with CUDNN_HALF=1
and with CUDNN_HALF=0
?
@AlexeyAB I will check and get back to you. Thanks a lot.
@AlexeyAB How long it takes to train YOLOv4 on COCO dataset with RTX2080 Ti and V100 respectively?
@AlexeyAB Thank you very much for your repositories. Could you answer for me short questions below:
Can i training with tensor cores GPU and running trained-weights with no-tensor core supported GPU ? ( I only want to reduce training time ).
I use my old workstation for training image with quadro K5000 GPU, which is very weak GPU for training yolo. I am going to buy GTX 1080ti or RTX 2080 (not Ti). Should i buy GTX 1080Ti (old) or RTX 2080 ?
Thanks again, sorry if not clear or duplicate question!
Nice to meet you.
I bought RTX 2080Ti today. By default, install all required items in the same environment as V100.
This setting has been tested. NVIDIA Driver 410.48 CUDA 10.0 & CuDN 7.3.0
Is this the recommended setting?