Training time is high - Githubissues

sctrueew commented 4 years ago

Hi @AlexeyAB,

I'm using this csresnext50-panet-spp-original-optimal model and I have two RTX 2080 Ti but the training time is about 340. Is it OK?

GPU: 2X RTX 2080Ti Storage: SSD CPU: Core i 9

[net] batch=64 subdivisions=32 width=608 height=608 channels=3 momentum=0.949 decay=0.0005 angle=0 saturation = 1.5 exposure = 1.5 hue=.1 flip=0 learning_rate=0.00261 burn_in=1000 max_batches = 100000 policy=steps steps=80000,90000 scales=.1,.1 . . . mask = 6,7,8 anchors = 35, 9, 34, 24, 72, 21, 62, 65, 117, 41, 138,101, 221, 68, 236,155, 510,139 classes=31 num=9

darknet.exe detector train obj.obj nest_panet_opt.cfg orginal_opt.conv.112 -dont_show -map -gpus 0,1

AlexeyAB commented 4 years ago

Do you use random=1 in the last [yolo] layer? random=1 increases mAP, but decreases training speed.
Try to train with it will enable tensor cores for training
```
[net]
loss_scale=128
```
Also try to set
```
batch=64
subdivisions=16
```

sctrueew commented 4 years ago

@AlexeyAB Hi,

No, I use the default setting of the csresnext50-panet-spp-original-optimal. I get an error with this

batch=64 subdivisions=16

 Try to set subdivisions=64 in your cfg-file.
CUDA status Error: file: ..\..\src\dark_cuda.c : cuda_make_array() : line: 373 : build time: Mar 30 2020 - 12:17:49
CUDA Error: out of memory

But 32 it works. Why csresnext50-panet-spp-original-optimal not by default "Random=1"? If it had, the accuracy would be greater than 64.4 on the Coco dataset?

Can I start the training with 2 GPU's, I mean it doesn't need to start with 1 GPU first at least 1K and continue the training after that with 2 GPU's?

Thanks

sctrueew commented 4 years ago

@AlexeyAB Hi,

I have added loss_scale=128 but the time is increased to 330

AlexeyAB commented 4 years ago

Why csresnext50-panet-spp-original-optimal not by default "Random=1"? If it had, the accuracy would be greater than 64.4 on the Coco dataset?

It is random=1 by default: https://github.com/AlexeyAB/darknet/blob/0e063371500bc998584aa58313cee04b5cf354c4/cfg/csresnext50-panet-spp-original-optimal.cfg#L1036

I have added loss_scale=128 but the time is increased to 330

You should train at least +1000 iterations than burn_in=1000 in your cfg-file to get the correct estimation time.

Can I start the training with 2 GPU's, I mean it doesn't need to start with 1 GPU first at least 1K and continue the training after that with 2 GPU's?

Better to train the first 1000 iterations by using 1 GPU.

sctrueew commented 4 years ago

@AlexeyAB Hi,

Unfortunately, sometimes I got this error when I use multi GPU's

CUDA Error Prev: unspecified launch failure CUDA status Error: file: ....\src\dark_cuda.c : cuda_push_array() : line: 469 : build time: Mar 30 2020 - 12:17:49 CUDA Error: unspecified launch failure

I continue the training again with the last weights. I think the problem is in Driver because I have another PC with the same config, but I haven't updated the driver yet and it works fine.

AlexeyAB commented 4 years ago

CUDA Error Prev: unspecified launch failure CUDA status Error: file: ....\src\dark_cuda.c : cuda_push_array() : line: 469 : build time: Mar 30 2020 - 12:17:49 CUDA Error: unspecified launch failure

This is new strange often error for multi-GPU training, which appears only on some servers, while on other servers there is no such problem.

I think the problem is in Driver because I have another PC with the same config, but I haven't updated the driver yet and it works fine.

What OS, CUDA, cuDNN and Driver versions do you use on both servers?

sctrueew commented 4 years ago

@AlexeyAB Hi,

Server1: GPU: 2X RTX 2080 Ti OS: Win 10 CUDA: 10.2 cuDNN: 7.6.5 cudnn_half = 1 Driver version:26.21.14.4575

Server2: GPU: 2X RTX 2080 Ti OS: Win 10 CUDA: 10 cuDNN: 7.4 cudnn_half = 1 Driver version:26.21.14.3200

sctrueew commented 4 years ago

@AlexeyAB Hi,

I still have a problem with training time. It may be due to the size of the dataset and the resolution of the images because I have a different resolution size in the dataset.

1719x1719
1920x1080
2704x1520
3840x2160

Is it better to resize all of the images before the training, If yes what size should I choose? Is the default network size good for this case?

Thanks.

AlexeyAB commented 4 years ago

If this time (red) is near zero, then the reason isn't Image-size / HDD-disk / CPU-bottleneck.

So you shouldn't resize images.

sctrueew commented 4 years ago

@AlexeyAB Hi, So what's the reason? I have a good hardware but for 31 classes it takes about 10 days.

AlexeyAB commented 4 years ago

Show screenshot of GPU-utilization in GPU-z or nvidia-smi utility.
Show such screenshot for 2000 iteration or higher:

sctrueew commented 4 years ago

Show screenshot of GPU-utilization in GPU-z or nvidia-smi utility.

Show such screenshot for 2000 iteration or higher:

AlexeyAB commented 4 years ago

So what's the reason? I have a good hardware but for 31 classes it takes about 10 days.

It doesn't depend on number of classes.

Do you train on MS COCO dataset?

Also show such parameters in your cfg-file:

[net]
batch=64
subdivisions=32
width=608
height=608

loss_scale=128

[yolo]
random=1

Try to set

[net]
batch=64
subdivisions=16
width=576
height=576

loss_scale=128

[yolo]
random=1

Or

[net]
batch=64
subdivisions=16
width=544
height=544

loss_scale=128

[yolo]
random=1

sctrueew commented 4 years ago

@AlexeyAB Hi,

Do you train on MS COCO dataset?

No, I don't

I will check it with 576 or 544 but I got a good result with 608. How much accuracy will decrease with 576 or 544?

sctrueew commented 4 years ago

@AlexeyAB Hi,

I have 31 classes and about 14,800 images that it's not balanced and I added the repeated files to the train.txt for balancing and now I have 104,338 in the train.txt. Is it right?

AlexeyAB commented 4 years ago

I have 31 classes and about 14,800 images that it's not balanced and I added the repeated files to the train.txt for balancing and now I have 104,338 in the train.txt. Is it right?

Yes, you can do this.

I will check it with 576 or 544 but I got a good result with 608. How much accuracy will decrease with 576 or 544?

It depends on your dataset. Default csresnext50-panet-spp-original-optimal was trained with width=512 height=512

sctrueew commented 4 years ago

@AlexeyAB Hi,

I get this error when I set 576 or 544, even 512 with batch=64 and subdivisions=16

Try to set subdivisions=64 in your cfg-file. CUDA status Error: file: ....\src\dark_cuda.c : cuda_make_array() : line: 373 : build time: Mar 30 2020 - 12:17:49 CUDA Error: out of memory

colinlin1982 commented 4 years ago

Show screenshot of GPU-utilization in GPU-z or nvidia-smi utility.

Show such screenshot for 2000 iteration or higher:

Excuse me, but shouldn't the "9.280000 seconds" indicate the training time cost for each batch? 9.28 100000 / (3600 24) = 10.74 days I think it's normal.

AlexeyAB / darknet

Training time is high #5147