Open yassinesefrioui opened 4 years ago
I am also interested in this. Would I need to set CUDDN_HALF to 1 when using a Tesla V100?
Show your chart.png
image.
in backup, i have several models, if use darknet53.conv.74 is works normally but if i use yolov3_1000.weights or another one, i still have the same problem.
Is yolov3_1000.weights trained for 6 classes?
it not works even if i use a single GPU.
So do you get Nan on 1 GPU?
@AlexeyAB thank you for your answer. This is my chart.png :
Yes, my yolov3 is trained for 6 classes. Yes, i get nan even with 1 GPU.
It is works normally just with my first model (Yolov3.weights) but it is not works with others models like yolov3_5000_AlexeyAB.weights.
It is works normally just with my first model (Yolov3.weights) but it is not works with others models like yolov3_5000_AlexeyAB.weights.
"works normally" - what does it mean?
"works normally" it is mean, the avg loss not equal "nan".
thank you
So just train by using darknet53.conv.74
or yolov3.weights
Did you check your training dataset by using Yolo_mark? Do you have files bad.list or bad_label.list?
Try to run training with flag -show_imgs
at the end of training command, it will generate aug_...jpg
images, do you see correct labels (bounded boxes) around objects?
by using yolov3.weights (it is my model like darknet53.conv.74 pre-training to find same objects).
i don't check my dataset by using Yolo_mark, but the data is correct, i checked it by my script.
( i use the same data and it works in darknet/pjreddie even after 50k iterations )
okey i will try with flag -show_imgs.
thank you
Hello ! Hi @AlexeyAB
YOLOv3 works normally before 6k iterations, but after 6k iterations, avg loss became nan.
& i use this command : ./darknet detector train bs-digital/voc.data bs-digital/yolov3.cfg backup/yolov3.weights -gpus 0,1,2,3 -dont_show –map
ARCH= -gencode arch=compute_30,code=sm_30 \ -gencode arch=compute_37,code=sm_37 \ -gencode arch=compute_50,code=[sm_50,compute_50] \ -gencode arch=compute_52,code=[sm_52,compute_52]
[net]
Testing
batch=1
subdivisions=1
Training
batch=96 subdivisions=32 width=608 height=608 channels=3 momentum=0.9 decay=0.0005 angle=0 saturation = 1.5 exposure = 1.5 hue=.1
learning_rate=0.001 burn_in=1000 max_batches = 500000 policy=steps steps=400000,450000 scales=.1,.1
[convolutional] batch_normalize=1 filters=32 size=3 stride=1 pad=1 activation=leaky
Downsample
[convolutional] batch_normalize=1 filters=64 size=3 stride=2 pad=1 activation=leaky
[convolutional] batch_normalize=1 filters=32 size=1 stride=1 pad=1 activation=leaky
[convolutional] batch_normalize=1 filters=64 size=3 stride=1 pad=1 activation=leaky
[shortcut] from=-3 activation=linear
Downsample
[convolutional] batch_normalize=1 filters=128 size=3 stride=2 pad=1 activation=leaky
[convolutional] batch_normalize=1 filters=64 size=1 stride=1 pad=1 activation=leaky
[convolutional] batch_normalize=1 filters=128 size=3 stride=1 pad=1 activation=leaky
[shortcut] from=-3 activation=linear
[convolutional] batch_normalize=1 filters=64 size=1 stride=1 pad=1 activation=leaky
[convolutional] batch_normalize=1 filters=128 size=3 stride=1 pad=1 activation=leaky
[shortcut] from=-3 activation=linear
Downsample
[convolutional] batch_normalize=1 filters=256 size=3 stride=2 pad=1 activation=leaky
[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=leaky
[convolutional] batch_normalize=1 filters=256 size=3 stride=1 pad=1 activation=leaky
[shortcut] from=-3 activation=linear
[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=leaky
[convolutional] batch_normalize=1 filters=256 size=3 stride=1 pad=1 activation=leaky
[shortcut] from=-3 activation=linear
[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=leaky
[convolutional] batch_normalize=1 filters=256 size=3 stride=1 pad=1 activation=leaky
[shortcut] from=-3 activation=linear
[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=leaky
[convolutional] batch_normalize=1 filters=256 size=3 stride=1 pad=1 activation=leaky
[shortcut] from=-3 activation=linear
[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=leaky
[convolutional] batch_normalize=1 filters=256 size=3 stride=1 pad=1 activation=leaky
[shortcut] from=-3 activation=linear
[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=leaky
[convolutional] batch_normalize=1 filters=256 size=3 stride=1 pad=1 activation=leaky
[shortcut] from=-3 activation=linear
[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=leaky
[convolutional] batch_normalize=1 filters=256 size=3 stride=1 pad=1 activation=leaky
[shortcut] from=-3 activation=linear
[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=leaky
[convolutional] batch_normalize=1 filters=256 size=3 stride=1 pad=1 activation=leaky
[shortcut] from=-3 activation=linear
Downsample
[convolutional] batch_normalize=1 filters=512 size=3 stride=2 pad=1 activation=leaky
[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=leaky
[convolutional] batch_normalize=1 filters=512 size=3 stride=1 pad=1 activation=leaky
[shortcut] from=-3 activation=linear
[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=leaky
[convolutional] batch_normalize=1 filters=512 size=3 stride=1 pad=1 activation=leaky
[shortcut] from=-3 activation=linear
[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=leaky
[convolutional] batch_normalize=1 filters=512 size=3 stride=1 pad=1 activation=leaky
[shortcut] from=-3 activation=linear
[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=leaky
[convolutional] batch_normalize=1 filters=512 size=3 stride=1 pad=1 activation=leaky
[shortcut] from=-3 activation=linear
[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=leaky
[convolutional] batch_normalize=1 filters=512 size=3 stride=1 pad=1 activation=leaky
[shortcut] from=-3 activation=linear
[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=leaky
[convolutional] batch_normalize=1 filters=512 size=3 stride=1 pad=1 activation=leaky
[shortcut] from=-3 activation=linear
[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=leaky
[convolutional] batch_normalize=1 filters=512 size=3 stride=1 pad=1 activation=leaky
[shortcut] from=-3 activation=linear
[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=leaky
[convolutional] batch_normalize=1 filters=512 size=3 stride=1 pad=1 activation=leaky
[shortcut] from=-3 activation=linear
Downsample
[convolutional] batch_normalize=1 filters=1024 size=3 stride=2 pad=1 activation=leaky
[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=leaky
[convolutional] batch_normalize=1 filters=1024 size=3 stride=1 pad=1 activation=leaky
[shortcut] from=-3 activation=linear
[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=leaky
[convolutional] batch_normalize=1 filters=1024 size=3 stride=1 pad=1 activation=leaky
[shortcut] from=-3 activation=linear
[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=leaky
[convolutional] batch_normalize=1 filters=1024 size=3 stride=1 pad=1 activation=leaky
[shortcut] from=-3 activation=linear
[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=leaky
[convolutional] batch_normalize=1 filters=1024 size=3 stride=1 pad=1 activation=leaky
[shortcut] from=-3 activation=linear
######################
[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=leaky
[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=1024 activation=leaky
[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=leaky
[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=1024 activation=leaky
[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=leaky
[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=1024 activation=leaky
[convolutional] size=1 stride=1 pad=1 filters=33 activation=linear
[yolo] mask = 6,7,8 anchors = 10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326 classes=6 num=9 jitter=.3 ignore_thresh = .7 truth_thresh = 1 random=1
[route] layers = -4
[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=leaky
[upsample] stride=2
[route] layers = -1, 61
[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=leaky
[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=512 activation=leaky
[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=leaky
[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=512 activation=leaky
[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=leaky
[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=512 activation=leaky
[convolutional] size=1 stride=1 pad=1 filters=33 activation=linear
[yolo] mask = 3,4,5 anchors = 10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326 classes=6 num=9 jitter=.3 ignore_thresh = .7 truth_thresh = 1 random=1
[route] layers = -4
[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=leaky
[upsample] stride=2
[route] layers = -1, 36
[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=leaky
[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=256 activation=leaky
[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=leaky
[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=256 activation=leaky
[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=leaky
[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=256 activation=leaky
[convolutional] size=1 stride=1 pad=1 filters=33 activation=linear
[yolo] mask = 0,1,2 anchors = 10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326 classes=6 num=9 jitter=.3 ignore_thresh = .7 truth_thresh = 1 random=1