AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.75k stars 7.96k forks source link

nan avg loss during training #5836

Closed netqyq closed 3 years ago

netqyq commented 4 years ago

If you have an issue with training - no-detections / Nan avg-loss / low accuracy:

yolov4-rubbish.cfg.txt

 130: 466.184570, 642.938660 avg loss, 0.000000 rate, 1.968303 seconds, 8320 images, 19.746453 hours left
 131: 773.097473, 655.954529 avg loss, 0.000000 rate, 3.330336 seconds, 8384 images, 19.657633 hours left
 132: 766.986023, 667.057678 avg loss, 0.000000 rate, 3.306173 seconds, 8448 images, 19.644869 hours left
 133: 768.746643, 677.226562 avg loss, 0.000000 rate, 3.381084 seconds, 8512 images, 19.630889 hours left
 134: 740.305725, 683.534485 avg loss, 0.000000 rate, 3.281588 seconds, 8576 images, 19.621175 hours left
 135: 733.656799, 688.546692 avg loss, 0.000000 rate, 3.369697 seconds, 8640 images, 19.606058 hours left
 136: 734.087097, 693.100708 avg loss, 0.000000 rate, 3.336241 seconds, 8704 images, 19.595945 hours left
 137: 714.507507, 695.241394 avg loss, 0.000000 rate, 3.294343 seconds, 8768 images, 19.584077 hours left
 138: 700.614136, 695.778687 avg loss, 0.000000 rate, 3.353261 seconds, 8832 images, 19.570008 hours left
 139: 694.423218, 695.643127 avg loss, 0.000000 rate, 3.343041 seconds, 8896 images, 19.559320 hours left
 140: 652.778015, 691.356628 avg loss, 0.000000 rate, 3.272530 seconds, 8960 images, 19.548166 hours left
 141: 659.008179, 688.121765 avg loss, 0.000000 rate, 46.281411 seconds, 9024 images, 19.533224 hours left
 142: -nan, -nan avg loss, 0.000000 rate, 46.714420 seconds, 9088 images, 21.909830 hours left
 143: -nan, -nan avg loss, 0.000000 rate, 44.583545 seconds, 9152 images, 24.267557 hours left
 144: -nan, -nan avg loss, 0.000000 rate, 48.660171 seconds, 9216 images, 26.484042 hours left
 145: -nan, -nan avg loss, 0.000000 rate, 44.227352 seconds, 9280 images, 28.903087 hours left
 146: -nan, -nan avg loss, 0.000000 rate, 43.479576 seconds, 9344 images, 31.053324 hours left
v3 (iou loss, Normalizer: (iou: 0.07, cls: 1.00) Region 161 Avg (IOU: 0.000000, GIOU: 0.000000), Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 52, class_loss = 24.500000, iou_loss = 0.000153, total_loss = 24.500153
v3 (iou loss, Normalizer: (iou: 0.07, cls: 1.00) Region 139 Avg (IOU: 0.000000, GIOU: -0.053107), Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 4, class_loss = -nan, iou_loss = -nan, total_loss = -nan
v3 (iou loss, Normalizer: (iou: 0.07, cls: 1.00) Region 150 Avg (IOU: 0.000000, GIOU: -0.000000), Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 14, class_loss = 7.000000, iou_loss = 0.000011, total_loss = 7.000011
v3 (iou loss, Normalizer: (iou: 0.07, cls: 1.00) Region 161 Avg (IOU: 0.000000, GIOU: -0.000353), Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 36, class_loss = 17.999998, iou_loss = 0.002825, total_loss= 18.002823
v3 (iou loss, Normalizer: (iou: 0.07, cls: 1.00) Region 139 Avg (IOU: 0.000000, GIOU: 0.000000), Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 1, class_loss = 0.000000, iou_loss = 0.000000, total_loss = 0.000000
v3 (iou loss, Normalizer: (iou: 0.07, cls: 1.00) Region 150 Avg (IOU: 0.000000, GIOU: -0.000000), Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 5, class_loss = 2.500000, iou_loss = 0.000004, total_loss =2.500004
v3 (iou loss, Normalizer: (iou: 0.07, cls: 1.00) Region 161 Avg (IOU: 0.000000, GIOU: -0.000000), Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 13, class_loss = 6.500000, iou_loss = 0.000022, total_loss = 6.500022

v3 (iou loss, Normalizer: (iou: 0.07, cls: 1.00) Region 139 Avg (IOU: 0.000000, GIOU: 0.000000), Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 1, class_loss = 0.000000, iou_loss = 0.000000, total_loss = 0.000000
v3 (iou loss, Normalizer: (iou: 0.07, cls: 1.00) Region 150 Avg (IOU: 0.000000, GIOU: 0.000000), Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 3, class_loss = 1.500000, iou_loss = 0.000005, total_loss = 1.500005

v3 (iou loss, Normalizer: (iou: 0.07, cls: 1.00) Region 161 Avg (IOU: 0.000000, GIOU: 0.000000), Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 8, class_loss = 4.000000, iou_loss = 0.003686, total_loss = 4.003686
Loaded: 0.000060 seconds

 116: -nan, -nan avg loss, 0.000000 rate, 40.276380 seconds, 7424 images, 233.787316 hours left
v3 (iou loss, Normalizer: (iou: 0.07, cls: 1.00) Region 139 Avg (IOU: 0.000000, GIOU: 0.000000), Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 1, class_loss = 0.000000, iou_loss = 0.000000, total_loss = 0.000000
v3 (iou loss, Normalizer: (iou: 0.07, cls: 1.00) Region 150 Avg (IOU: 0.000000, GIOU: -0.000001), Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 2, class_loss = 1.000000, iou_loss = 0.000002, total_loss =1.000002
netqyq commented 4 years ago

If you have an issue with training - no-detections / Nan avg-loss / low accuracy:

  • what command do you use?
./darknet detector train data/rubbish_data.data cfg/yolov4-rubbish.cfg /home/ma-user/work/yolov4.conv.137
* what dataset do you use?   

My customized dataset which has 44 classes.

  • what Loss and mAP did you get? avg nan
  • show chart.png with Loss and mAP
  • check your dataset - run training with flag -show_imgs i.e. ./darknet detector train ... -show_imgs and look at the aug_...jpg images, do you see correct truth bounded boxes? no boxes there
  • rename your cfg-file to txt-file and drag-n-drop (attach) to your message here

yolov4-rubbish.cfg.txt

./darknet detector test cfg/coco.data cfg/yolov4.cfg yolov4.weights data/dog.jpg
CUDA-version: 9000 (10020), cuDNN: 7.4.1, CUDNN_HALF=1, GPU count: 1
 CUDNN_HALF=1
 OpenCV version: 4.9.1
 0 : compute_capability = 600, cudnn_half = 0, GPU: Tesla P100-PCIE-16GB
net.optimized_memory = 0
mini_batch = 1, batch = 8, time_steps = 1, train = 0
   layer   filters  size/strd(dil)      input                output
   0 conv     32       3 x 3/ 1    608 x 608 x   3 ->  608 x 608 x  32 0.639 BF
 130: 466.184570, 642.938660 avg loss, 0.000000 rate, 1.968303 seconds, 8320 images, 19.746453 hours left
 131: 773.097473, 655.954529 avg loss, 0.000000 rate, 3.330336 seconds, 8384 images, 19.657633 hours left
 132: 766.986023, 667.057678 avg loss, 0.000000 rate, 3.306173 seconds, 8448 images, 19.644869 hours left
 133: 768.746643, 677.226562 avg loss, 0.000000 rate, 3.381084 seconds, 8512 images, 19.630889 hours left
 134: 740.305725, 683.534485 avg loss, 0.000000 rate, 3.281588 seconds, 8576 images, 19.621175 hours left
 135: 733.656799, 688.546692 avg loss, 0.000000 rate, 3.369697 seconds, 8640 images, 19.606058 hours left
 136: 734.087097, 693.100708 avg loss, 0.000000 rate, 3.336241 seconds, 8704 images, 19.595945 hours left
 137: 714.507507, 695.241394 avg loss, 0.000000 rate, 3.294343 seconds, 8768 images, 19.584077 hours left
 138: 700.614136, 695.778687 avg loss, 0.000000 rate, 3.353261 seconds, 8832 images, 19.570008 hours left
 139: 694.423218, 695.643127 avg loss, 0.000000 rate, 3.343041 seconds, 8896 images, 19.559320 hours left
 140: 652.778015, 691.356628 avg loss, 0.000000 rate, 3.272530 seconds, 8960 images, 19.548166 hours left
 141: 659.008179, 688.121765 avg loss, 0.000000 rate, 46.281411 seconds, 9024 images, 19.533224 hours left
 142: -nan, -nan avg loss, 0.000000 rate, 46.714420 seconds, 9088 images, 21.909830 hours left
 143: -nan, -nan avg loss, 0.000000 rate, 44.583545 seconds, 9152 images, 24.267557 hours left
 144: -nan, -nan avg loss, 0.000000 rate, 48.660171 seconds, 9216 images, 26.484042 hours left
 145: -nan, -nan avg loss, 0.000000 rate, 44.227352 seconds, 9280 images, 28.903087 hours left
 146: -nan, -nan avg loss, 0.000000 rate, 43.479576 seconds, 9344 images, 31.053324 hours left
AlexeyAB commented 4 years ago
  • check your dataset - run training with flag -show_imgs i.e. ./darknet detector train ... -showimgs and look at the aug...jpg images, do you see correct truth bounded boxes? no boxes there

Your dataset is wrong.

Show content of several txt-label files.

Read: https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects

You should label each object on images from your dataset. Use this visual GUI-software for marking bounded boxes of objects and generating annotation files for Yolo v2 & v3: https://github.com/AlexeyAB/Yolo_mark

It will create .txt-file for each .jpg-image-file - in the same directory and with the same name, but with .txt-extension, and put to file: object number and object coordinates on this image, for each object in new line:

Where: - integer object number from 0 to (classes-1) - float values relative to width and height of image, it can be equal from (0.0 to 1.0] for example: = / or = / atention: - are center of rectangle (are not top-left corner) For example for img1.jpg you will be created img1.txt containing: 1 0.716797 0.395833 0.216406 0.147222 0 0.687109 0.379167 0.255469 0.158333 1 0.420312 0.395833 0.140625 0.166667
netqyq commented 4 years ago

My annotation txt files are converted from VOC xml format. This is my convert script.

The dataset has total 44 classes and around 15,000 images, each image has multiple labels, that means an image could contain several objects.

After the converting operation, I checked them by eyeball via LabelIMG tool.

Convert VOC

<annotation>
    <folder>1label</folder>
    <filename>2b09f04c1b078b57980c0ac9cc18c6b.jpg</filename>
    <path>C:\Users\hwx594248\Desktop\1label\2b09f04c1b078b57980c0ac9cc18c6b.jpg</path>
    <source>
        <database>Unknown</database>
    </source>
    <size>
        <width>1080</width>
        <height>1440</height>
        <depth>3</depth>
    </size>
    <segmented>0</segmented>
    <object>
        <name>金属厨具</name>
        <pose>Unspecified</pose>
        <truncated>1</truncated>
        <difficult>0</difficult>
        <bndbox>
            <xmin>302</xmin>
            <ymin>1</ymin>
            <xmax>877</xmax>
            <ymax>1440</ymax>
        </bndbox>
    </object>
    <object>
        <name>砧板</name>
        <pose>Unspecified</pose>
        <truncated>1</truncated>
        <difficult>0</difficult>
        <bndbox>
            <xmin>1</xmin>
            <ymin>1</ymin>
            <xmax>1025</xmax>
            <ymax>1440</ymax>
        </bndbox>
    </object>
</annotation>

to txt for YOLO

35 0.5458333333333333 0.5003472222222223 0.5324074074074074 0.9993055555555556
24 0.475 0.5003472222222223 0.9481481481481482 0.9993055555555556

image

another txt file

1 0.5005208333333333 0.5684895833333333 0.9989583333333333 0.8630208333333333
18 0.659375 0.140625 0.6354166666666666 0.20833333333333334
AlexeyAB commented 4 years ago

Do you need classifier or detector? Since on the images is just one object

netqyq commented 4 years ago

I need object detection on image and draw bonding box.

labelImg :Users:yq:Downloads:datasets:trainval:VOC2007:JPEGImages:img_52 jpg 2020-06-03 22-45-46

image

AlexeyAB commented 4 years ago

Show screenshot of cloud of points: ./darknet detector calc_anchors data/obj.data -num_of_clusters 9 -width 416 -height 416 -show

netqyq commented 4 years ago

image

image

AlexeyAB commented 4 years ago

image

netqyq commented 4 years ago

Why without GPU? because image is captured on my Macbook, the parameters are the same as my training server, my training process is on a server, which can not exceute show(-show).

This is the capture from my training server: image

I will try it: burn_in=5000

I am using the latest repo which was downloaded around two days ago.

by the way, I have access to another server with 24 cores and 700+GB memory, how to utilize it better?

netqyq commented 4 years ago

I started the training process from last 100 iterations save point,with burn_in=5000, again got nan

 174: 933.341797, 780.749146 avg loss, 0.000000 rate, 3.172139 seconds, 11136 images, 17.991696 hours left
 175: 971.599854, 799.834229 avg loss, 0.000000 rate, 3.260993 seconds, 11200 images, 17.986480 hours left
 176: 941.787659, 814.029541 avg loss, 0.000000 rate, 3.154222 seconds, 11264 images, 17.986201 hours left
 177: 937.036682, 826.330261 avg loss, 0.000000 rate, 3.189103 seconds, 11328 images, 17.980036 hours left
 178: 965.046997, 840.201904 avg loss, 0.000000 rate, 3.262084 seconds, 11392 images, 17.975844 hours left
 179: 946.607117, 850.842407 avg loss, 0.000000 rate, 3.230607 seconds, 11456 images, 17.975704 hours left
 180: 960.578308, 861.815979 avg loss, 0.000000 rate, 3.254701 seconds, 11520 images, 17.973824 hours left
 181: 590.941345, 834.728516 avg loss, 0.000000 rate, 1.937414 seconds, 11584 images, 17.973280 hours left
 182: -nan, -nan avg loss, 0.000000 rate, 2.011170 seconds, 11648 images, 17.923213 hours left
 183: -nan, -nan avg loss, 0.000000 rate, 2.055895 seconds, 11712 images, 17.854699 hours left
 184: -nan, -nan avg loss, 0.000000 rate, 2.025939 seconds, 11776 images, 17.789327 hours left
 185: -nan, -nan avg loss, 0.000000 rate, 2.049610 seconds, 11840 images, 17.722955 hours left
 186: -nan, -nan avg loss, 0.000000 rate, 2.035549 seconds, 11904 images, 17.658543 hours left
 187: -nan, -nan avg loss, 0.000000 rate, 2.026518 seconds, 11968 images, 17.593997 hours left
netqyq commented 4 years ago

When I start from scratch with burn_in=5000, I got -nan at the 32th iteration.

 27: 1094.624878, 1398.332397 avg loss, 0.000000 rate, 1.929200 seconds, 1728 images, 20.005706 hours left
 28: 1091.277344, 1367.626953 avg loss, 0.000000 rate, 1.947525 seconds, 1792 images, 19.912686 hours left
 29: 1096.390747, 1340.503296 avg loss, 0.000000 rate, 1.932057 seconds, 1856 images, 19.821607 hours left
 30: 1083.744751, 1314.827393 avg loss, 0.000000 rate, 1.946099 seconds, 1920 images, 19.730575 hours left
 31: 935.924011, 1276.937012 avg loss, 0.000000 rate, 1.675474 seconds, 1984 images, 19.641228 hours left
 32: -nan, -nan avg loss, 0.000000 rate, 1.728330 seconds, 2048 images, 19.554481 hours left
 33: -nan, -nan avg loss, 0.000000 rate, 1.661758 seconds, 2112 images, 19.454805 hours left

cfg

[net]
# Training
batch=64
subdivisions=16
width=416
height=416
channels=3
momentum=0.949
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1

learning_rate=0.0001
burn_in=5000
max_batches = 20000
policy=steps
steps=16000,18000
scales=.1,.1
...
AlexeyAB commented 4 years ago

Can you reproduce Nan issue with small dataset, ~100 - 1000 training images, and share these images? I will try to train.

Also you can try to change max_delta=5 to max_delta=1 for each [yolo] layer in cfg-file

netqyq commented 4 years ago

Again, nan appear. Today, I picked 600 images from the dataset, with 120 test images and 480 train images. Nan appears at the the 484th iteration.

 472: 12.685884, 18.907265 avg loss, 0.000005 rate, 3.444814 seconds, 30208 images, 14.311615 hours left
 473: 14.853024, 18.501841 avg loss, 0.000005 rate, 3.451708 seconds, 30272 images, 14.355366 hours left
 474: 27.073414, 19.358997 avg loss, 0.000005 rate, 3.554246 seconds, 30336 images, 14.399043 hours left
 475: 16.928411, 19.115938 avg loss, 0.000005 rate, 3.516808 seconds, 30400 images, 14.447836 hours left
 476: 15.022050, 18.706549 avg loss, 0.000005 rate, 3.484360 seconds, 30464 images, 14.494101 hours left
 477: 19.145966, 18.750490 avg loss, 0.000005 rate, 3.516784 seconds, 30528 images, 14.538133 hours left
 478: 9.029431, 17.778385 avg loss, 0.000005 rate, 3.463961 seconds, 30592 images, 14.583474 hours left
 479: 17.181927, 17.718739 avg loss, 0.000005 rate, 3.523369 seconds, 30656 images, 14.625488 hours left
 480: 9.644224, 16.911287 avg loss, 0.000005 rate, 3.399598 seconds, 30720 images, 14.670293 hours left

 481: 16.324453, 16.852604 avg loss, 0.000005 rate, 56.586223 seconds, 30784 images, 14.707929 hours left
 482: -nan, -nan avg loss, 0.000005 rate, 48.838414 seconds, 30848 images, 17.643123 hours left
 483: -nan, -nan avg loss, 0.000005 rate, 64.476111 seconds, 30912 images, 20.114554 hours left
 484: -nan, -nan avg loss, 0.000005 rate, 45.124677 seconds, 30976 images, 23.408916 hours left

This is the whole dataset rubbish dataset(VOC), you can use this subset.py to pick any number of images you want from the whole dataset, it helps you to generate the train.txt and test.txt, you can set maximum number of images to pick in each class, for picking evenly from each class.

rubbish_600.data

classes = 44
train = /home/ma-user/work/trainval/600/train.txt
valid = /home/ma-user/work/trainval/600/test.txt
names = /home/ma-user/work/trainval/VOC2007/JPEGImages/classes.names
backup = backup

Thank you!

AlexeyAB commented 4 years ago

Can you share exactly your 600 images and yolo-labels for the purity of the experiment?

netqyq commented 4 years ago

OK, you can download from this 600 imgs or from 600 imgs, they are the same data.

AlexeyAB commented 4 years ago

It works well. It seems something wrong with you libraries CUDA, cuDNN, OpenCV or with Darknet installation.

darknet.exe detector train data\600-imgs/img.data data\600-imgs\yolov4-rubbish.cfg yolov4.conv.137 -map

600-imgs.zip

in cfg: batch=64 subdivisions=32

Content of img.data file:

classes= 44
train  = data\600-imgs/train600.txt
valid  = data\600-imgs/test600.txt
names = data\600-imgs/train_classes.txt
backup = backup/
eval=coco

image

chart

netqyq commented 4 years ago

Thank you very much! I have been using a server with one GPU, I will check it again on other hardware platforms.