AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.75k stars 7.96k forks source link

In training, is it normal that the CPU usage rate is 100% for one core and the GPU usage rate is 30% or less? #5287

Closed mokoenator closed 3 years ago

mokoenator commented 4 years ago

Hi @AlexeyAB .

CPU and GPU are not fully utilized.

I am training in the following environment.

yolov3_5l.cfg

[net]
batch=64
subdivisions=20
width=608
height=608
・・・・

CPU:AMD Ryzen Threadripper 2950X 16-Core Processor GPU:RTX TAITAN

image image

A lot of CPUs were running in the version around June of last year It took over 20 minutes to perform 100 iterations. (11,000 training images) image

slow in the current version. It takes over 40 minutes to perform 100 iterations. (12,000 training images) image

The modified code is detector.c, line 152 and 357.

line152

#ifdef OPENCV
    //args.threads = 6 * ngpus;   // 3 for - Amazon EC2 Tesla V100: p3.2xlarge (8 logical cores) - p3.16xlarge
    args.threads = 20 * ngpus;    // Ryzen 7 2700X (16 logical cores)

line357

        if (iteration % 100 == 0) {
        //if (iteration >= (iter_save + 1000) || iteration % 1000 == 0) {

The message when running darknet.exe is as follows

  CUDA-version: 10010 (10010), cuDNN: 7.6.0, CUDNN_HALF = 1, GPU count: 2
  CUDNN_HALF = 1
  OpenCV version: 3.4.0
yolov3_5l
  compute_capability = 750, cudnn_half = 1
net.optimized_memory = 0
mini_batch = 3, batch = 60, time_steps = 1, train = 1
    layer filters size / strd (dil) input output
    0 conv 32 3 x 3/1 608 x 608 x 3-> 608 x 608 x 32 0.639 BF
...
...

 (next mAP calculation at 7046 iterations)
 Last accuracy mAP@0.5 = 75.92 %, best = 75.92 %
 Tensor Cores are used.
 5749: 0.706832, 0.603698 avg loss, 0.001000 rate, 22.226000 seconds, 344940 images, 229.779482 hours left
Loaded: 0.000000 seconds
v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 82 Avg (IOU: 0.829406, GIOU: 0.826049), Class: 0.691292, Obj: 0.737168, No Obj: 0.004334, .5R: 1.000000, .75R: 0.750000, count: 8, class_loss = 1.640729, iou_loss = 0.311962, total_loss = 1.952691
v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 94 Avg (IOU: 0.834724, GIOU: 0.834075), Class: 0.968123, Obj: 0.473304, No Obj: 0.000430, .5R: 1.000000, .75R: 0.666667, count: 3, class_loss = 0.581441, iou_loss = 0.163724, total_loss = 0.745164
v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.

Is this normal?

AlexeyAB commented 4 years ago

batch=64 subdivisions=20

batch/subdivisions - should be integer value.



  1. Try to set 30 instead of 5 and recompile https://github.com/AlexeyAB/darknet/blob/88f28f7fcc8fff88fff6dc90a7b4b5474e9a52ff/src/data.c#L1430

  2. If it doesn't help, then try to set args.threads = 12 * ngpus; and recompile https://github.com/AlexeyAB/darknet/blob/88f28f7fcc8fff88fff6dc90a7b4b5474e9a52ff/src/detector.c#L152-L153

Does it help?

mokoenator commented 4 years ago

Thank you for reply!

Try to set 30 instead of 5 and recompile

Originally it was set to 5.

If it doesn't help, then try to set args.threads = 12 * ngpus; and recompile

Recompiled

What command do you use for training?

darknet.exe detector train D:\git_work\yolo-set\yolo_runner\my.data D:\git_work\yolo-set\yolo_runner\yolov3_5l.cfg #log\yolov3_5l_5100.weights -gpus 0 -map

Do you get this issue with yolov3.cfg instead of yolov3_5l.cfg?

Almost default, I tried to run it. My custom model is 15 Class.

[net]
# Testing

# Training
batch = 64
subdivisions = 32
width = 416
height = 416
channels = 3
momentum = 0.9
decay = 0.0005
angle = 0
saturation = 1.5
exposure = 1.5
hue = .1

...

filters = 60
activation = linear

[yolo]
mask = 6,7,8
anchors = 10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326
classes = 15

image

GPU LOAD is now 60

image About 7 minutes


It ’s not a good question to ask in this issue, We make the following services. Is yolov3.cfg better than larger models like yolov3_5l.cfg? I also want to select small objects such as shoes and sandals. (About the size of the image below)

https://funnel-service.com/predictor.html?command=idsearch&id=592d09c39f07d9949e682b808373360ea93c0b01

image

AlexeyAB commented 4 years ago

Originally it was set to 5.

So did you try 30 and 5? With what value is training faster?


GPU LOAD is now 60 About 7 minutes

Try now yolov3_5l.cfg with batch = 64 subdivisions = 32 width=608 height=608 and args.threads = 12 * ngpus; What CPU/GPU load and training time do you get?


I also want to select small objects such as shoes and sandals. (About the size of the image below)

Calculate anchors for -width 608 -height 608, don't use it in cfg, just show me - I will say what sizes of objects in your dataset. ./darknet detector calc_anchors data/obj.data -num_of_clusters 9 -width 608 -height 608


May be better to use yolov3-spp.cfg with batch = 64 subdivisions = 16 or 32 width=608 height=608

mokoenator commented 4 years ago

retry...

Same as last time yolov3.cfg

[net]
# Testing

# Training
batch = 64
subdivisions = 32
width = 416
height = 416
channels = 3
momentum = 0.9
decay = 0.0005
angle = 0
saturation = 1.5
exposure = 1.5
hue = .1

...

filters = 60
activation = linear

[yolo]
mask = 6,7,8
anchors = 10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326
classes = 15
args.threads = 12 * ngpus;    // Ryzen 7 2700X (16 logical cores)
and
static const int thread_wait_ms = 30;

image

image

About 7 minutes


args.threads = 12 * ngpus;    // Ryzen 7 2700X (16 logical cores)
and
static const int thread_wait_ms = 5;

image

image

About 7 minutes


I tried running it with the source I cloned a long time ago Branch message ...

Revision: 3aca0b71666bac0dd5760833aea036e7bd897c8a
Author: AlexeyAB <alexeyab84@gmail.com>
Date: 2019/05/25 0:48:11
Message:
conv-LSTM training speedup

----
Modified: src/conv_lstm_layer.c
Modified: src/image_opencv.cpp

The modified code is detector.c (Edit at the same time)

#ifdef OPENCV
    //args.threads = 3 * ngpus;   // Amazon EC2 Tesla V100: p3.2xlarge (8 logical cores) - p3.16xlarge
    args.threads = 28 * ngpus;

...

        if (i % 100 == 0) {
        //if (i >= (iter_save + 1000) || i % 1000 == 0) {
            iter_save = i;

image

image

About 5 minutes

↑↑↑↑↑Anything helpful?


I try to calculate the anchor and run yolov3-spp.cfg. Thank you

AlexeyAB commented 4 years ago

So use default args.threads = 28 * ngpus; and static const int thread_wait_ms = 5;

Just additional 2 yolo layers (in yolov3_5l.cfg) are too slow.


Yes, train yolov3_spp.cfg


If you will use Darknet for Detection rather than other frameworks (TRT, OpenCV-dnn, TF, ...) then you can try to train

mokoenator commented 4 years ago

When yolov3_spp.cfg is executed, the GPU usage rate is 80%! Thank you!

I didn't notice there was v4. I will try