Open sctrueew opened 4 years ago
Do you use random=1
in the last [yolo] layer? random=1 increases mAP, but decreases training speed.
Try to train with it will enable tensor cores for training
[net]
loss_scale=128
Also try to set
batch=64
subdivisions=16
@AlexeyAB Hi,
No, I use the default setting of the csresnext50-panet-spp-original-optimal. I get an error with this
batch=64 subdivisions=16
Try to set subdivisions=64 in your cfg-file. CUDA status Error: file: ..\..\src\dark_cuda.c : cuda_make_array() : line: 373 : build time: Mar 30 2020 - 12:17:49 CUDA Error: out of memory
But 32 it works. Why csresnext50-panet-spp-original-optimal not by default "Random=1"? If it had, the accuracy would be greater than 64.4 on the Coco dataset?
Can I start the training with 2 GPU's, I mean it doesn't need to start with 1 GPU first at least 1K and continue the training after that with 2 GPU's?
Thanks
@AlexeyAB Hi,
I have added loss_scale=128 but the time is increased to 330
Why csresnext50-panet-spp-original-optimal not by default "Random=1"? If it had, the accuracy would be greater than 64.4 on the Coco dataset?
It is random=1 by default: https://github.com/AlexeyAB/darknet/blob/0e063371500bc998584aa58313cee04b5cf354c4/cfg/csresnext50-panet-spp-original-optimal.cfg#L1036
I have added loss_scale=128 but the time is increased to 330
You should train at least +1000 iterations than burn_in=1000
in your cfg-file to get the correct estimation time.
Can I start the training with 2 GPU's, I mean it doesn't need to start with 1 GPU first at least 1K and continue the training after that with 2 GPU's?
Better to train the first 1000 iterations by using 1 GPU.
@AlexeyAB Hi,
Unfortunately, sometimes I got this error when I use multi GPU's
CUDA Error Prev: unspecified launch failure CUDA status Error: file: ....\src\dark_cuda.c : cuda_push_array() : line: 469 : build time: Mar 30 2020 - 12:17:49 CUDA Error: unspecified launch failure
I continue the training again with the last weights. I think the problem is in Driver because I have another PC with the same config, but I haven't updated the driver yet and it works fine.
CUDA Error Prev: unspecified launch failure CUDA status Error: file: ....\src\dark_cuda.c : cuda_push_array() : line: 469 : build time: Mar 30 2020 - 12:17:49 CUDA Error: unspecified launch failure
This is new strange often error for multi-GPU training, which appears only on some servers, while on other servers there is no such problem.
I think the problem is in Driver because I have another PC with the same config, but I haven't updated the driver yet and it works fine.
@AlexeyAB Hi,
Server1: GPU: 2X RTX 2080 Ti OS: Win 10 CUDA: 10.2 cuDNN: 7.6.5 cudnn_half = 1 Driver version:26.21.14.4575
Server2: GPU: 2X RTX 2080 Ti OS: Win 10 CUDA: 10 cuDNN: 7.4 cudnn_half = 1 Driver version:26.21.14.3200
@AlexeyAB Hi,
I still have a problem with training time. It may be due to the size of the dataset and the resolution of the images because I have a different resolution size in the dataset.
Is it better to resize all of the images before the training, If yes what size should I choose? Is the default network size good for this case?
Thanks.
If this time (red) is near zero, then the reason isn't Image-size / HDD-disk / CPU-bottleneck.
So you shouldn't resize images.
@AlexeyAB Hi, So what's the reason? I have a good hardware but for 31 classes it takes about 10 days.
Show screenshot of GPU-utilization in GPU-z or nvidia-smi utility.
Show such screenshot for 2000 iteration or higher:
Show screenshot of GPU-utilization in GPU-z or nvidia-smi utility.
Show such screenshot for 2000 iteration or higher:
So what's the reason? I have a good hardware but for 31 classes it takes about 10 days.
It doesn't depend on number of classes.
Do you train on MS COCO dataset?
Also show such parameters in your cfg-file:
[net]
batch=64
subdivisions=32
width=608
height=608
loss_scale=128
[yolo]
random=1
Try to set
[net]
batch=64
subdivisions=16
width=576
height=576
loss_scale=128
[yolo]
random=1
Or
[net]
batch=64
subdivisions=16
width=544
height=544
loss_scale=128
[yolo]
random=1
@AlexeyAB Hi,
Do you train on MS COCO dataset?
No, I don't
I will check it with 576 or 544 but I got a good result with 608. How much accuracy will decrease with 576 or 544?
@AlexeyAB Hi,
I have 31 classes and about 14,800 images that it's not balanced and I added the repeated files to the train.txt for balancing and now I have 104,338 in the train.txt. Is it right?
I have 31 classes and about 14,800 images that it's not balanced and I added the repeated files to the train.txt for balancing and now I have 104,338 in the train.txt. Is it right?
Yes, you can do this.
I will check it with 576 or 544 but I got a good result with 608. How much accuracy will decrease with 576 or 544?
It depends on your dataset.
Default csresnext50-panet-spp-original-optimal
was trained with width=512 height=512
@AlexeyAB Hi,
I get this error when I set 576 or 544, even 512 with batch=64 and subdivisions=16
Try to set subdivisions=64 in your cfg-file. CUDA status Error: file: ....\src\dark_cuda.c : cuda_make_array() : line: 373 : build time: Mar 30 2020 - 12:17:49 CUDA Error: out of memory
Show screenshot of GPU-utilization in GPU-z or nvidia-smi utility.
Show such screenshot for 2000 iteration or higher:
Excuse me, but shouldn't the "9.280000 seconds" indicate the training time cost for each batch? 9.28 100000 / (3600 24) = 10.74 days I think it's normal.
Hi @AlexeyAB,
I'm using this csresnext50-panet-spp-original-optimal model and I have two RTX 2080 Ti but the training time is about 340. Is it OK?
GPU: 2X RTX 2080Ti Storage: SSD CPU: Core i 9
[net] batch=64 subdivisions=32 width=608 height=608 channels=3 momentum=0.949 decay=0.0005 angle=0 saturation = 1.5 exposure = 1.5 hue=.1 flip=0 learning_rate=0.00261 burn_in=1000 max_batches = 100000 policy=steps steps=80000,90000 scales=.1,.1 . . . mask = 6,7,8 anchors = 35, 9, 34, 24, 72, 21, 62, 65, 117, 41, 138,101, 221, 68, 236,155, 510,139 classes=31 num=9
darknet.exe detector train obj.obj nest_panet_opt.cfg orginal_opt.conv.112 -dont_show -map -gpus 0,1