AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.77k stars 7.96k forks source link

Train yolov4 with contrastive loss but get segmentation fault and N==0 error #6625

Open jylink opened 4 years ago

jylink commented 4 years ago

I can train a normal yolov4, but when I trained yolov4+contrastive for embedding, I got error after some iterations, sometimes it is segmentation fault, sometimes it is N==0 error. What could be the reason?


Makefile:

GPU=1
CUDNN=1
CUDNN_HALF=1
OPENCV=1
AVX=1
OPENMP=1
LIBSO=1
ZED_CAMERA=0
ZED_CAMERA_v2_8=0
USE_CPP=0
DEBUG=0

Command I used:

./darknet detector train cfg/bf-vis.data cfg/yolov4-em-bf-vis.cfg weights/yolov4.conv.137 \
-mjpeg_port 8090 -map -dont_show

Header info:

CUDA-version: 10010 (10010), cuDNN: 7.6.4, CUDNN_HALF=1, GPU count: 4  
 CUDNN_HALF=1 
 OpenCV version: 3.4.2
 Prepare additional network for mAP calculation...
 0 : compute_capability = 700, cudnn_half = 1, GPU: Tesla V100-SXM2-32GB 
net.optimized_memory = 0 
mini_batch = 1, batch = 16, time_steps = 1, train = 0 
   layer   filters  size/strd(dil)      input                output
   0 conv     32       3 x 3/ 1    608 x 608 x   3 ->  608 x 608 x  32 0.639 BF
   1 conv     64       3 x 3/ 2    608 x 608 x  32 ->  304 x 304 x  64 3.407 BF

Error message (segmentation fault):

 Tensor Cores are disabled until the first 3000 iterations are reached.
 (next mAP calculation at 1000 iterations) 
 72: 3321.924805, 4465.612305 avg loss, 0.000000 rate, 9.668162 seconds, 4608 images, 37.265379 hours left
  avg_contrastive_acc = 17.674627 
 MJPEG-stream sent. 
Loaded: 0.000041 seconds
v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 149 Avg (IOU: 0.390754, GIOU: 0.197088), Class: 0.481200, Obj: 0.374952, No Obj: 0.334788, .5R: 0.100000, .75R: 0.000000, count: 10, class_loss = 6304.377441, iou_loss = 11.058594, total_loss = 6315.436035 
 Contrast accuracy = 87.000000 %, all = 8, good = 7, same = 8, diff = 8 
 contrastive loss = 5.682550 

...

v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 149 Avg (IOU: 0.437507, GIOU: 0.350589), Class: 0.511675, Obj: 0.328880, No Obj: 0.331702, .5R: 0.285714, .75R: 0.000000, count: 7, class_loss = 6176.578125, iou_loss = 3.053711, total_loss = 6179.631836 
 Contrast accuracy = 33.000000 %, all = 6, good = 2, same = 6, diff = 6 
Segmentation fault (core dumped)

Error message (N==0):

 Tensor Cores are disabled until the first 3000 iterations are reached.
 (next mAP calculation at 1000 iterations) 
 79: 2121.149902, 3380.669434 avg loss, 0.000000 rate, 10.034842 seconds, 5056 images, 37.315313 hours left
  avg_contrastive_acc = 23.644901 
 MJPEG-stream sent. 
Loaded: 0.000042 seconds
v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 149 Avg (IOU: 0.464567, GIOU: 0.376872), Class: 0.442608, Obj: 0.271741, No Obj: 0.264202, .5R: 0.363636, .75R: 0.090909, count: 22, class_loss = 3953.858154, iou_loss = 9.941895, total_loss = 3963.800049 
 Contrast accuracy = -1.000000 %, all = 0, good = 0, same = 0, diff = 0 
 contrastive loss = 0.006436 

...

v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 149 Avg (IOU: 0.470824, GIOU: 0.415021), Class: 0.487077, Obj: 0.295903, No Obj: 0.261723, .5R: 0.500000, .75R: 0.000000, count: 10, class_loss = 3877.018066, iou_loss = 3.769531, total_loss = 3880.787598 
 Contrast accuracy = 30.000000 %, all = 10, good = 3, same = 10, diff = 10 
 Error: N == 0 || temperature == 0 || vec_len == 0. N=0.000000, temperature=1.000000, vec_len=203811264940474368.000000, labels[i] = 21975 

The darknet repo is cloned in September 5

Dataset I used is my own dataset with 4 classes. About 12,000 images in jpg, jpeg, and png.

No bad.list or bad_label.list founded.

cfg: yolov4-em-bf-vis.txt

hellboy5 commented 4 years ago

where you able to resolve your issue, as I was wondering the same thing? also is it possible to keep multiple yolo heads, and just use the contrastive loss on one of the inputs to the yolo layers

jylink commented 4 years ago

@hellboy5 nope, still no idea :/

pauliustumas commented 4 years ago

Might be a guess, but very similar error was happening with very early version of Yolov4 (NaN issue). What @AlexeyAB did, so he decreased learning rate twice. I tried to do the same and so far learning is still in progress. You could do the same - set learning_rate=0.00131 in yolov4-tiny_contrastive.cfg

pauliustumas commented 4 years ago

Decreasing learning rate twice helped at least in my case

jylink commented 4 years ago

@pauliustumas hi, did you decrease learning rate in yolov4_contrastive or in yolov4_tiny_contrastive? I mean I got those random errors only in yolov4_contrastive while yolov4_tiny_contrastive works fine.

pauliustumas commented 4 years ago

the error was happening with yolov4_tiny_contrastive using custom dataset, but haven't tested with yolov4_contrastive yet

pauliustumas commented 3 years ago

Yes, in your case the issue would be definitely not the learning rate. Tried to run your version, issue happens like you described.

hellboy5 commented 3 years ago

What does classes mean in the contrastive section ? In the last yolo layer classes is 80 but in the last contrastive layer classes is 1

pauliustumas commented 3 years ago

I guess it was an error and on my case I have changed to one