Yolo Tiny training fails: CUDA Error: invalid argument: File exists

GiusBen commented 4 years ago

Hi, I'm not really sure this is an issue, most likely it's me but I've spent this afternoon looking for a way around it to no avail. Basically I'm trying to start a yolov3-tiny training following the instructions here: https://github.com/AlexeyAB/darknet#how-to-train-tiny-yolo-to-detect-your-custom-objects . Any idea why this happens? Thanks in advance.

OS: Ubuntu 18.04 CUDA: 10.1 GPU: GTX 1080 Ti

Makefile flags:

GPU=1
CUDNN=1
CUDNN_HALF=0
OPENCV=1
AVX=1
OPENMP=0
LIBSO=0
ZED_CAMERA=0 # ZED SDK 3.0 and above
ZED_CAMERA_v2_8=0 # ZED SDK 2.X

My .cfg: yolov3-tiny_obj.cfg.txt

How I'm starting the training: From a bash script, only containing the command ./darknet detector train data/obj.data ../data/v3-tiny/yolov3-tiny_obj.cfg ../data/v3-tiny/yolov3-tiny.conv.15 -dont_show -mjpeg_port 6007 -map

What I get:

CUDA-version: 10010 (11000), cuDNN: 7.6.5, GPU count: 1
 OpenCV version: 3.2.0
 Prepare additional network for mAP calculation...
 0 : compute_capability = 610, cudnn_half = 0, GPU: GeForce GTX 1080 Ti
net.optimized_memory = 0
mini_batch = 1, batch = 16, time_steps = 1, train = 0
   layer   filters  size/strd(dil)      input                output
   0 conv     16       3 x 3/ 1    416 x 416 x   3 ->  416 x 416 x  16 0.150 BF
   1 max                2x 2/ 2    416 x 416 x  16 ->  208 x 208 x  16 0.003 BF
   2 conv     32       3 x 3/ 1    208 x 208 x  16 ->  208 x 208 x  32 0.399 BF
   3 max                2x 2/ 2    208 x 208 x  32 ->  104 x 104 x  32 0.001 BF
   4 conv     64       3 x 3/ 1    104 x 104 x  32 ->  104 x 104 x  64 0.399 BF
   5 max                2x 2/ 2    104 x 104 x  64 ->   52 x  52 x  64 0.001 BF
   6 conv    128       3 x 3/ 1     52 x  52 x  64 ->   52 x  52 x 128 0.399 BF
   7 max                2x 2/ 2     52 x  52 x 128 ->   26 x  26 x 128 0.000 BF
   8 conv    256       3 x 3/ 1     26 x  26 x 128 ->   26 x  26 x 256 0.399 BF
   9 max                2x 2/ 2     26 x  26 x 256 ->   13 x  13 x 256 0.000 BF
  10 conv    512       3 x 3/ 1     13 x  13 x 256 ->   13 x  13 x 512 0.399 BF
  11 max                2x 2/ 1     13 x  13 x 512 ->   13 x  13 x 512 0.000 BF
  12 conv   1024       3 x 3/ 1     13 x  13 x 512 ->   13 x  13 x1024 1.595 BF
  13 conv    256       1 x 1/ 1     13 x  13 x1024 ->   13 x  13 x 256 0.089 BF
  14 conv    512       3 x 3/ 1     13 x  13 x 256 ->   13 x  13 x 512 0.399 BF
  15 conv     21       1 x 1/ 1     13 x  13 x 512 ->   13 x  13 x  21 0.004 BF
  16 yolo
[yolo] params: iou loss: mse (2), iou_norm: 0.75, cls_norm: 1.00, scale_x_y: 1.00
  17 route  13                                     ->   13 x  13 x 256
  18 conv    128       1 x 1/ 1     13 x  13 x 256 ->   13 x  13 x 128 0.011 BF
  19 upsample                 2x    13 x  13 x 128 ->   26 x  26 x 128
  20 route  19 8                                   ->   26 x  26 x 384
  21 conv    256       3 x 3/ 1     26 x  26 x 384 ->   26 x  26 x 256 1.196 BF
  22 conv     21       1 x 1/ 1     26 x  26 x 256 ->   26 x  26 x  21 0.007 BF
  23 yolo
[yolo] params: iou loss: mse (2), iou_norm: 0.75, cls_norm: 1.00, scale_x_y: 1.00
Total BFLOPS 5.449
avg_outputs = 325057
 Allocate additional workspace_size = 52.43 MB
yolov3-tiny_obj
 0 : compute_capability = 610, cudnn_half = 0, GPU: GeForce GTX 1080 Ti
net.optimized_memory = 0
mini_batch = 4, batch = 64, time_steps = 1, train = 1
   layer   filters  size/strd(dil)      input                output
   0 conv     16       3 x 3/ 1    416 x 416 x   3 ->  416 x 416 x  16 0.150 BF
   1 max                2x 2/ 2    416 x 416 x  16 ->  208 x 208 x  16 0.003 BF
   2 conv     32       3 x 3/ 1    208 x 208 x  16 ->  208 x 208 x  32 0.399 BF
   3 max                2x 2/ 2    208 x 208 x  32 ->  104 x 104 x  32 0.001 BF
   4 conv     64       3 x 3/ 1    104 x 104 x  32 ->  104 x 104 x  64 0.399 BF
   5 max                2x 2/ 2    104 x 104 x  64 ->   52 x  52 x  64 0.001 BF
   6 conv    128       3 x 3/ 1     52 x  52 x  64 ->   52 x  52 x 128 0.399 BF
   7 max                2x 2/ 2     52 x  52 x 128 ->   26 x  26 x 128 0.000 BF
   8 conv    256       3 x 3/ 1     26 x  26 x 128 ->   26 x  26 x 256 0.399 BF
   9 max                2x 2/ 2     26 x  26 x 256 ->   13 x  13 x 256 0.000 BF
  10 conv    512       3 x 3/ 1     13 x  13 x 256 ->   13 x  13 x 512 0.399 BF
  11 max                2x 2/ 1     13 x  13 x 512 ->   13 x  13 x 512 0.000 BF
  12 conv   1024       3 x 3/ 1     13 x  13 x 512 ->   13 x  13 x1024 1.595 BF
  13 conv    256       1 x 1/ 1     13 x  13 x1024 ->   13 x  13 x 256 0.089 BF
  14 conv    512       3 x 3/ 1     13 x  13 x 256 ->   13 x  13 x 512 0.399 BF
  15 conv     21       1 x 1/ 1     13 x  13 x 512 ->   13 x  13 x  21 0.004 BF
  16 yolo
[yolo] params: iou loss: mse (2), iou_norm: 0.75, cls_norm: 1.00, scale_x_y: 1.00
  17 route  13                                     ->   13 x  13 x 256
  18 conv    128       1 x 1/ 1     13 x  13 x 256 ->   13 x  13 x 128 0.011 BF
  19 upsample                 2x    13 x  13 x 128 ->   26 x  26 x 128
  20 route  19 8                                   ->   26 x  26 x 384
  21 conv    256       3 x 3/ 1     26 x  26 x 384 ->   26 x  26 x 256 1.196 BF
  22 conv     21       1 x 1/ 1     26 x  26 x 256 ->   26 x  26 x  21 0.007 BF
  23 yolo
[yolo] params: iou loss: mse (2), iou_norm: 0.75, cls_norm: 1.00, scale_x_y: 1.00
Total BFLOPS 5.449
avg_outputs = 325057
 Allocate additional workspace_size = 52.43 MB
Loading weights from ../data/v3-tiny/yolov3-tiny.conv.15...
 seen 64, trained: 0 K-images (0 Kilo-batches_64)
Done! Loaded 15 layers from weights-file
Learning Rate: 0.001, Momentum: 0.9, Decay: 0.0005
Resizing, random_coef = 1.40

 608 x 608
 Create 6 permanent cpu-threads
 try to allocate additional workspace_size = 52.43 MB
 CUDA allocate done!
Loaded: 0.000043 seconds
CUDA status Error: file: ./src/dark_cuda.c : () : line: 477 : build time: Jun 24 2020 - 17:15:57

 CUDA Error: invalid argument
CUDA Error: invalid argument: File exists
darknet: ./src/utils.c:326: error: Assertion `0' failed.
./start-training.sh: line 7: 27484 Aborted                 (core dumped) ./darknet detector train data/obj.data ../data/v3-tiny/yolov3-tiny_obj.cfg ../data/v3-tiny/yolov3-tiny.conv.15 -dont_show -mjpeg_port 6007 -map

I see the block

0 : compute_capability = 610, cudnn_half = 0, GPU: GeForce GTX 1080 Ti
...
Allocate additional workspace_size = 52.43 MB

output twice, is that expected?

Also, the function cuda_pull_array in dark_cuda.c gets called twice, with variable status (line 476) getting assigned 0 the first time and 1 the second (maybe this is what triggers the early exit?).

imaami commented 4 years ago

Do you happen to have the core dump file? If not, can you enable core dumps, then run again, and upload the core dump and the executable?

GiusBen commented 4 years ago

Yes, I've uploaded both the dump and the executable here, along with the Makefile (apart from the aforementioned flags, the only other thing I've set is ARCH at line 36). OS is Ubuntu 18.04 x86_64, kernel 4.15.0-55-generic.

UPDATE:

The same problem showed with yolov4-tiny. It turned out that max=50 at the and of the .cfg file was the culprit (I don't know why I set it to 50 when the instruction said to set it to 200 or more).

imaami commented 4 years ago

@GiusBen OK, good to hear you found a fix. It's nevertheless a bug in darknet if it crashes mysteriously simply because some parameter in a config file is incorrect.

I had a quick look at the dump already, but because I'm running Debian Sid I couldn't install all the dependency libraries right away. I'll create an Ubuntu 18.04 chroot to get all the libraries and debugging symbols available, maybe the backtrace reveals a low-hanging fruit to fix.

AlexeyAB / darknet

Yolo Tiny training fails: CUDA Error: invalid argument: File exists #6066

UPDATE: