Can not train on custom data, obtaining segmentation fault

mixtoism commented 3 years ago

I am training on an AWS ml.p3.2xlarge instance using Sagemaker SDK. When running the train the process fails with a Segmentation fault after reading a few images

I have compiled darknet in a Docker with CUDA 10.0 and cuDNN 7 image nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04 with the following flags: GPU=1 CUDNN=1 OPENCV=1 USE_CPP=1

When I run darknet/darknet detector train /cfg/obj-custom.data /cfg/yolov4_tiny-custom.cfg /opt/ml/input/data/conf/yolov4-tiny.conv.29 -dont_show I get the following output:

CUDA-version: 10000 (10020), cuDNN: 7.6.5, GPU count: 1  
 DEBUG=1 
 OpenCV version: 4.9.1d
 0 : compute_capability = 700, cudnn_half = 0, GPU: Tesla V100-SXM2-16GB 
   layer   filters  size/strd(dil)      input                output
   0 conv     32       3 x 3/ 2    416 x 416 x   3 ->  208 x 208 x  32 0.075 BF
   1 conv     64       3 x 3/ 2    208 x 208 x  32 ->  104 x 104 x  64 0.399 BF
   2 conv     64       3 x 3/ 1    104 x 104 x  64 ->  104 x 104 x  64 0.797 BF
   3 route  2 #011#011                       1/2 ->  104 x 104 x  32 
   4 conv     32       3 x 3/ 1    104 x 104 x  32 ->  104 x 104 x  32 0.199 BF
   5 conv     32       3 x 3/ 1    104 x 104 x  32 ->  104 x 104 x  32 0.199 BF
   6 route  5 4 #011                           ->  104 x 104 x  64 
   7 conv     64       1 x 1/ 1    104 x 104 x  64 ->  104 x 104 x  64 0.089 BF
   8 route  2 7 #011                           ->  104 x 104 x 128 
   9 max                2x 2/ 2    104 x 104 x 128 ->   52 x  52 x 128 0.001 BF
  10 conv    128       3 x 3/ 1     52 x  52 x 128 ->   52 x  52 x 128 0.797 BF
  11 route  10 #011#011                       1/2 ->   52 x  52 x  64 
  12 conv     64       3 x 3/ 1     52 x  52 x  64 ->   52 x  52 x  64 0.199 BF
  13 conv     64       3 x 3/ 1     52 x  52 x  64 ->   52 x  52 x  64 0.199 BF
  14 route  13 12 #011                           ->   52 x  52 x 128 
  15 conv    128       1 x 1/ 1     52 x  52 x 128 ->   52 x  52 x 128 0.089 BF
  16 route  10 15 #011                           ->   52 x  52 x 256 
  17 max                2x 2/ 2     52 x  52 x 256 ->   26 x  26 x 256 0.001 BF
  18 conv    256       3 x 3/ 1     26 x  26 x 256 ->   26 x  26 x 256 0.797 BF
  19 route  18 #011#011                       1/2 ->   26 x  26 x 128 
  20 conv    128       3 x 3/ 1     26 x  26 x 128 ->   26 x  26 x 128 0.199 BF
  21 conv    128       3 x 3/ 1     26 x  26 x 128 ->   26 x  26 x 128 0.199 BF
  22 route  21 20 #011                           ->   26 x  26 x 256 
  23 conv    256       1 x 1/ 1     26 x  26 x 256 ->   26 x  26 x 256 0.089 BF
  24 route  18 23 #011                           ->   26 x  26 x 512 
  25 max                2x 2/ 2     26 x  26 x 512 ->   13 x  13 x 512 0.000 BF
  26 conv    512       3 x 3/ 1     13 x  13 x 512 ->   13 x  13 x 512 0.797 BF
  27 conv    256       1 x 1/ 1     13 x  13 x 512 ->   13 x  13 x 256 0.044 BF
  28 conv    512       3 x 3/ 1     13 x  13 x 256 ->   13 x  13 x 512 0.399 BF
  29 conv     24       1 x 1/ 1     13 x  13 x 512 ->   13 x  13 x  24 0.004 BF
  30 yolo
[yolo] params: iou loss: ciou (4), iou_norm: 0.07, obj_norm: 1.00, cls_norm: 1.00, delta_norm: 1.00, scale_x_y: 1.05
  31 route  27 #011#011                           ->   13 x  13 x 256 
  32 conv    128       1 x 1/ 1     13 x  13 x 256 ->   13 x  13 x 128 0.011 BF
  33 upsample                 2x    13 x  13 x 128 ->   26 x  26 x 128
  34 route  33 23 #011                           ->   26 x  26 x 384 
  35 conv    256       3 x 3/ 1     26 x  26 x 384 ->   26 x  26 x 256 1.196 BF
  36 conv     24       1 x 1/ 1     26 x  26 x 256 ->   26 x  26 x  24 0.008 BF
  37 yolo
[yolo] params: iou loss: ciou (4), iou_norm: 0.07, obj_norm: 1.00, cls_norm: 1.00, delta_norm: 1.00, scale_x_y: 1.05
Total BFLOPS 6.790 
avg_outputs = 299930 
 Allocate additional workspace_size = 26.22 MB 
yolov4_tiny-custom
net.optimized_memory = 0 
mini_batch = 1, batch = 16, time_steps = 1, train = 1 
nms_kind: greedynms (1), beta = 0.600000 
nms_kind: greedynms (1), beta = 0.600000 
Loading weights from /opt/ml/input/data/conf/yolov4-tiny.conv.29...Done! Loaded 29 layers from weights-file 
 Create 6 permanent cpu-threads 

 seen 64, trained: 0 K-images (0 Kilo-batches_64) 
Cannot load image /opt/ml/input/data/data/3be93979615fded0.jpg
Learning Rate: 0.00261, Momentum: 0.9, Decay: 0.0005
 Detection layer: 30 - type = 28 
 Detection layer: 37 - type = 28 

 Error in load_data_detection() - OpenCV 
Cannot load image /opt/ml/input/data/data/72ac361867cddb83.jpg
Can't open label file. (This can be normal only if you use MSCOCO): /opt/ml/input/data/data/fa5d3633c56878cc.txt 
[...] // SOME IN THE LINE
Can't open label file. (This can be normal only if you use MSCOCO): /opt/ml/input/data/data/6b1d49f553dc3185.txt 
Loaded: 0.093033 seconds
v3 (iou loss, Normalizer: (iou: 0.07, cls: 1.00) Region 30 Avg (IOU: 0.000000, GIOU: 0.000000), Class: 0.000000, Obj: 0.000000, No Obj: 0.475280, .5R: 0.000000, .75R: 0.000000, count: 1, class_loss = 123.650963, iou_loss = 0.000000, total_loss = 123.650963 
[...] // MANY SIMILAR TO THIS ONE
 total_bbox = 2, rewritten_bbox = 0.000000 % 
v3 (iou loss, Normalizer: (iou: 0.07, cls: 1.00) Region 30 Avg (IOU: 0.000000, GIOU: 0.000000), Class: 0.000000, Obj: 0.000000, No Obj: 0.477373, .5R: 0.000000, .75R: 0.000000, count: 1, class_loss = 126.433594, iou_loss = 0.000000, total_loss = 126.433594 
v3 (iou loss, Normalizer: (iou: 0.07, cls: 1.00) Region 37 Avg (IOU: 0.000000, GIOU: 0.000000), Class: 0.000000, Obj: 0.000000, No Obj: 0.509810, .5R: 0.000000, .75R: 0.000000, count: 1, class_loss = 573.745911, iou_loss = 0.000000, total_loss = 573.745911 
 total_bbox = 3, rewritten_bbox = 0.000000 % 
/app/train.sh: line 15:    20 Segmentation fault      darknet/darknet detector train /cfg/obj-custom.data /cfg/yolov4_tiny-custom.cfg /opt/ml/input/data/conf/yolov4-tiny.conv.29 -dont_show

Any pointer regarding a solution would be very much appreciated

mixtoism commented 3 years ago

Compiled without GPU support, the problem persists, so my guess is that it depends on the data

stephanecharette commented 3 years ago

You likely have a bad image, or bad annotations. Make sure every image in the training list is valid, and the annotations are also valid. I ran into this in the past where one of the images for some reason didn't transfer correctly and was a 0-byte file. Darknet would segfault when it would get to that image.

mixtoism commented 3 years ago

I'm trying that right now. I will keep you posted

shuang1204 commented 3 years ago

我也遇到了同样的情况，你这个问题解决了嘛@混合主义

AlexeyAB / darknet

Can not train on custom data, obtaining segmentation fault #6983