Not able to resume from saved weights nor yolo-tiny

NotsOverflow commented 5 years ago

The Problem

Training a specific model using yolov3-tiny.cfg
The data is 1 class, the configuration and VOC data-set have been corrected accordingly.
It's not able to resume training from saved one or the yolov3-tiny pretrained one.
Yet It can resume from darknet53.conv.74

Some Infos

It's compiled with the following options :

GPU=1
CUDNN=1
CUDNN_HALF=0
OPENCV=1
AVX=1
OPENMP=0
LIBSO=0
ZED_CAMERA=0
DEBUG=1

ARCH= -gencode arch=compute_30,code=sm_30 \
      -gencode arch=compute_35,code=sm_35 \
      -gencode arch=compute_50,code=[sm_50,compute_50] \
      -gencode arch=compute_52,code=[sm_52,compute_52] \
      -gencode arch=compute_61,code=[sm_61,compute_61]

Last output from training:

 9999: 1.758102, 1.635159 avg loss, 0.000010 rate, 5.400419 seconds, 639936 images
Loaded: 0.000042 seconds
Region 16 Avg IOU: 0.702927, Class: 0.998371, Obj: 0.286974, No Obj: 0.002685, .5R: 0.900000, .75R: 0.300000,  count: 10
Region 23 Avg IOU: 0.706249, Class: 0.998734, Obj: 0.142701, No Obj: 0.000121, .5R: 1.000000, .75R: 0.000000,  count: 1
Region 16 Avg IOU: 0.814198, Class: 0.999443, Obj: 0.339490, No Obj: 0.001901, .5R: 1.000000, .75R: 0.833333,  count: 6
Region 23 Avg IOU: 0.511209, Class: 0.998074, Obj: 0.056416, No Obj: 0.000163, .5R: 0.500000, .75R: 0.000000,  count: 2
Region 16 Avg IOU: 0.761932, Class: 0.999598, Obj: 0.140708, No Obj: 0.001993, .5R: 1.000000, .75R: 0.600000,  count: 5
Region 23 Avg IOU: 0.725618, Class: 0.997905, Obj: 0.181799, No Obj: 0.000231, .5R: 1.000000, .75R: 0.250000,  count: 4
Region 16 Avg IOU: 0.707494, Class: 0.999397, Obj: 0.043593, No Obj: 0.001296, .5R: 1.000000, .75R: 0.333333,  count: 6
Region 23 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000060, .5R: -nan, .75R: -nan,  count: 0
Region 16 Avg IOU: 0.732325, Class: 0.992690, Obj: 0.086387, No Obj: 0.001262, .5R: 1.000000, .75R: 0.666667,  count: 3
Region 23 Avg IOU: 0.601230, Class: 0.999520, Obj: 0.004902, No Obj: 0.000089, .5R: 1.000000, .75R: 0.000000,  count: 1
Region 16 Avg IOU: 0.688970, Class: 0.999634, Obj: 0.275698, No Obj: 0.002817, .5R: 1.000000, .75R: 0.000000,  count: 5
Region 23 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000131, .5R: -nan, .75R: -nan,  count: 0
Region 16 Avg IOU: 0.751253, Class: 0.998739, Obj: 0.181112, No Obj: 0.002876, .5R: 1.000000, .75R: 0.555556,  count: 9
Region 23 Avg IOU: 0.795128, Class: 0.999631, Obj: 0.164175, No Obj: 0.000085, .5R: 1.000000, .75R: 0.500000,  count: 2
Region 16 Avg IOU: 0.739599, Class: 0.999520, Obj: 0.396461, No Obj: 0.002951, .5R: 0.857143, .75R: 0.857143,  count: 7
Region 23 Avg IOU: 0.720030, Class: 0.999544, Obj: 0.120141, No Obj: 0.000357, .5R: 1.000000, .75R: 0.400000,  count: 5
Region 16 Avg IOU: 0.686548, Class: 0.999488, Obj: 0.099333, No Obj: 0.001890, .5R: 1.000000, .75R: 0.400000,  count: 5
Region 23 Avg IOU: 0.633770, Class: 0.999999, Obj: 0.619372, No Obj: 0.000184, .5R: 1.000000, .75R: 0.000000,  count: 1
Region 16 Avg IOU: 0.716642, Class: 0.999353, Obj: 0.269416, No Obj: 0.001971, .5R: 1.000000, .75R: 0.250000,  count: 4
Region 23 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000042, .5R: -nan, .75R: -nan,  count: 0
Region 16 Avg IOU: 0.768687, Class: 0.999829, Obj: 0.191831, No Obj: 0.003386, .5R: 1.000000, .75R: 0.545455,  count: 11
Region 23 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000026, .5R: -nan, .75R: -nan,  count: 0
Region 16 Avg IOU: 0.702018, Class: 0.998927, Obj: 0.205704, No Obj: 0.002598, .5R: 0.900000, .75R: 0.400000,  count: 10
Region 23 Avg IOU: 0.292048, Class: 0.999282, Obj: 0.005860, No Obj: 0.000081, .5R: 0.000000, .75R: 0.000000,  count: 1
Region 16 Avg IOU: 0.721211, Class: 0.999873, Obj: 0.097425, No Obj: 0.001717, .5R: 1.000000, .75R: 0.400000,  count: 5
Region 23 Avg IOU: 0.501038, Class: 0.999394, Obj: 0.316764, No Obj: 0.000629, .5R: 0.600000, .75R: 0.000000,  count: 5
Region 16 Avg IOU: 0.724767, Class: 0.999281, Obj: 0.392172, No Obj: 0.003209, .5R: 1.000000, .75R: 0.375000,  count: 8
Region 23 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000077, .5R: -nan, .75R: -nan,  count: 0
Region 16 Avg IOU: 0.817750, Class: 0.999545, Obj: 0.212047, No Obj: 0.002177, .5R: 1.000000, .75R: 0.750000,  count: 4
Region 23 Avg IOU: 0.718841, Class: 0.999646, Obj: 0.220507, No Obj: 0.000213, .5R: 1.000000, .75R: 0.333333,  count: 3
Region 16 Avg IOU: 0.773540, Class: 0.999379, Obj: 0.111581, No Obj: 0.002086, .5R: 1.000000, .75R: 0.666667,  count: 3
Region 23 Avg IOU: 0.631157, Class: 0.999853, Obj: 0.059767, No Obj: 0.000068, .5R: 1.000000, .75R: 0.000000,  count: 1

 10000: 1.444123, 1.616055 avg loss, 0.000010 rate, 5.393773 seconds, 640000 images
Saving weights to /home/user/Documents/alt_darknet/backup/f-tiny3_10000.weights
Saving weights to /home/user/Documents/alt_darknet/backup/f-tiny3_last.weights
Saving weights to /home/user/Documents/alt_darknet/backup/f-tiny3_final.weights

Output from trying to resume

$> optirun ./darknet detector train  ffolder/fvoc.data ffolder/f-tiny3.cfg f-tiny3_final.weights
f-tiny3
layer     filters    size              input                output
   0 conv     16  3 x 3 / 1   416 x 416 x   3   ->   416 x 416 x  16 0.150 BF
   1 max          2 x 2 / 2   416 x 416 x  16   ->   208 x 208 x  16 0.003 BF
   2 conv     32  3 x 3 / 1   208 x 208 x  16   ->   208 x 208 x  32 0.399 BF
   3 max          2 x 2 / 2   208 x 208 x  32   ->   104 x 104 x  32 0.001 BF
   4 conv     64  3 x 3 / 1   104 x 104 x  32   ->   104 x 104 x  64 0.399 BF
   5 max          2 x 2 / 2   104 x 104 x  64   ->    52 x  52 x  64 0.001 BF
   6 conv    128  3 x 3 / 1    52 x  52 x  64   ->    52 x  52 x 128 0.399 BF
   7 max          2 x 2 / 2    52 x  52 x 128   ->    26 x  26 x 128 0.000 BF
   8 conv    256  3 x 3 / 1    26 x  26 x 128   ->    26 x  26 x 256 0.399 BF
   9 max          2 x 2 / 2    26 x  26 x 256   ->    13 x  13 x 256 0.000 BF
  10 conv    512  3 x 3 / 1    13 x  13 x 256   ->    13 x  13 x 512 0.399 BF
  11 max          2 x 2 / 1    13 x  13 x 512   ->    13 x  13 x 512 0.000 BF
  12 conv   1024  3 x 3 / 1    13 x  13 x 512   ->    13 x  13 x1024 1.595 BF
  13 conv    256  1 x 1 / 1    13 x  13 x1024   ->    13 x  13 x 256 0.089 BF
  14 conv    512  3 x 3 / 1    13 x  13 x 256   ->    13 x  13 x 512 0.399 BF
  15 conv     18  1 x 1 / 1    13 x  13 x 512   ->    13 x  13 x  18 0.003 BF
  16 yolo
  17 route  13
  18 conv    128  1 x 1 / 1    13 x  13 x 256   ->    13 x  13 x 128 0.011 BF
  19 upsample            2x    13 x  13 x 128   ->    26 x  26 x 128
  20 route  19 8
  21 conv    256  3 x 3 / 1    26 x  26 x 384   ->    26 x  26 x 256 1.196 BF
  22 conv     18  1 x 1 / 1    26 x  26 x 256   ->    26 x  26 x  18 0.006 BF
  23 yolo
Total BFLOPS 5.448 
 Allocate additional workspace_size = 52.43 MB 
Loading weights from f-tiny3_final.weights...
 seen 64 
Done!
Learning Rate: 0.001, Momentum: 0.9, Decay: 0.0005
 If error occurs - run training with flag: -dont_show 
Saving weights to /home/user/Documents/alt_darknet/backup/f-tiny3_final.weights

AlexeyAB commented 5 years ago

$> optirun ./darknet detector train  ffolder/fvoc.data ffolder/f-tiny3.cfg f-tiny3_final.weights
f-tiny3
layer     filters    size              input                output
   0 conv     16  3 x 3 / 1   416 x 416 x   3   ->   416 x 416 x  16 0.150 BF
....
Loading weights from f-tiny3_final.weights...
 seen 64 
Done!
Learning Rate: 0.001, Momentum: 0.9, Decay: 0.0005
 If error occurs - run training with flag: -dont_show 
Saving weights to /home/user/Documents/alt_darknet/backup/f-tiny3_final.weights

It's not able to resume training from saved one or the yolov3-tiny pretrained one.

You must increase max_batches= in your cfg-file. Or use -clear flag.

It's not able to resume training from saved one or the yolov3-tiny pretrained one.

It should be able to use yolov3-tiny.conv.15 pre-trained weights without any changes: https://github.com/AlexeyAB/darknet#how-to-train-tiny-yolo-to-detect-your-custom-objects

NotsOverflow commented 5 years ago

Worked like a charm, Thanks @AlexeyAB :)

AlexeyAB / darknet

Not able to resume from saved weights nor yolo-tiny #2989

The Problem

Some Infos

It's compiled with the following options :

Last output from training:

Output from trying to resume