Closed netqyq closed 3 years ago
If you have an issue with training - no-detections / Nan avg-loss / low accuracy:
- what command do you use?
./darknet detector train data/rubbish_data.data cfg/yolov4-rubbish.cfg /home/ma-user/work/yolov4.conv.137
* what dataset do you use?
My customized dataset which has 44 classes.
- what Loss and mAP did you get? avg nan
- show chart.png with Loss and mAP
- check your dataset - run training with flag
-show_imgs
i.e../darknet detector train ... -show_imgs
and look at theaug_...jpg
images, do you see correct truth bounded boxes? no boxes there- rename your cfg-file to txt-file and drag-n-drop (attach) to your message here
- show content of generated files
bad.list
andbad_label.list
if they exist empty file.
- Read
How to train (to detect your custom objects)
andHow to improve object detection
in the Readme: https://github.com/AlexeyAB/darknet/blob/master/README.md- show such screenshot with info
./darknet detector test cfg/coco.data cfg/yolov4.cfg yolov4.weights data/dog.jpg CUDA-version: 9000 (10020), cuDNN: 7.4.1, CUDNN_HALF=1, GPU count: 1 CUDNN_HALF=1 OpenCV version: 4.9.1 0 : compute_capability = 600, cudnn_half = 0, GPU: Tesla P100-PCIE-16GB net.optimized_memory = 0 mini_batch = 1, batch = 8, time_steps = 1, train = 0 layer filters size/strd(dil) input output 0 conv 32 3 x 3/ 1 608 x 608 x 3 -> 608 x 608 x 32 0.639 BF
130: 466.184570, 642.938660 avg loss, 0.000000 rate, 1.968303 seconds, 8320 images, 19.746453 hours left 131: 773.097473, 655.954529 avg loss, 0.000000 rate, 3.330336 seconds, 8384 images, 19.657633 hours left 132: 766.986023, 667.057678 avg loss, 0.000000 rate, 3.306173 seconds, 8448 images, 19.644869 hours left 133: 768.746643, 677.226562 avg loss, 0.000000 rate, 3.381084 seconds, 8512 images, 19.630889 hours left 134: 740.305725, 683.534485 avg loss, 0.000000 rate, 3.281588 seconds, 8576 images, 19.621175 hours left 135: 733.656799, 688.546692 avg loss, 0.000000 rate, 3.369697 seconds, 8640 images, 19.606058 hours left 136: 734.087097, 693.100708 avg loss, 0.000000 rate, 3.336241 seconds, 8704 images, 19.595945 hours left 137: 714.507507, 695.241394 avg loss, 0.000000 rate, 3.294343 seconds, 8768 images, 19.584077 hours left 138: 700.614136, 695.778687 avg loss, 0.000000 rate, 3.353261 seconds, 8832 images, 19.570008 hours left 139: 694.423218, 695.643127 avg loss, 0.000000 rate, 3.343041 seconds, 8896 images, 19.559320 hours left 140: 652.778015, 691.356628 avg loss, 0.000000 rate, 3.272530 seconds, 8960 images, 19.548166 hours left 141: 659.008179, 688.121765 avg loss, 0.000000 rate, 46.281411 seconds, 9024 images, 19.533224 hours left 142: -nan, -nan avg loss, 0.000000 rate, 46.714420 seconds, 9088 images, 21.909830 hours left 143: -nan, -nan avg loss, 0.000000 rate, 44.583545 seconds, 9152 images, 24.267557 hours left 144: -nan, -nan avg loss, 0.000000 rate, 48.660171 seconds, 9216 images, 26.484042 hours left 145: -nan, -nan avg loss, 0.000000 rate, 44.227352 seconds, 9280 images, 28.903087 hours left 146: -nan, -nan avg loss, 0.000000 rate, 43.479576 seconds, 9344 images, 31.053324 hours left
- check your dataset - run training with flag -show_imgs i.e. ./darknet detector train ... -showimgs and look at the aug...jpg images, do you see correct truth bounded boxes? no boxes there
Your dataset is wrong.
Show content of several txt-label files.
Read: https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects
You should label each object on images from your dataset. Use this visual GUI-software for marking bounded boxes of objects and generating annotation files for Yolo v2 & v3: https://github.com/AlexeyAB/Yolo_mark
It will create .txt-file for each .jpg-image-file - in the same directory and with the same name, but with .txt-extension, and put to file: object number and object coordinates on this image, for each object in new line:
Where: - integer object number from 0 to (classes-1) - float values relative to width and height of image, it can be equal from (0.0 to 1.0] for example: = / or = / atention: - are center of rectangle (are not top-left corner) For example for img1.jpg you will be created img1.txt containing: 1 0.716797 0.395833 0.216406 0.147222 0 0.687109 0.379167 0.255469 0.158333 1 0.420312 0.395833 0.140625 0.166667
My annotation txt files are converted from VOC xml format. This is my convert script.
The dataset has total 44 classes and around 15,000 images, each image has multiple labels, that means an image could contain several objects.
After the converting operation, I checked them by eyeball via LabelIMG tool.
Convert VOC
<annotation>
<folder>1label</folder>
<filename>2b09f04c1b078b57980c0ac9cc18c6b.jpg</filename>
<path>C:\Users\hwx594248\Desktop\1label\2b09f04c1b078b57980c0ac9cc18c6b.jpg</path>
<source>
<database>Unknown</database>
</source>
<size>
<width>1080</width>
<height>1440</height>
<depth>3</depth>
</size>
<segmented>0</segmented>
<object>
<name>金属厨具</name>
<pose>Unspecified</pose>
<truncated>1</truncated>
<difficult>0</difficult>
<bndbox>
<xmin>302</xmin>
<ymin>1</ymin>
<xmax>877</xmax>
<ymax>1440</ymax>
</bndbox>
</object>
<object>
<name>砧板</name>
<pose>Unspecified</pose>
<truncated>1</truncated>
<difficult>0</difficult>
<bndbox>
<xmin>1</xmin>
<ymin>1</ymin>
<xmax>1025</xmax>
<ymax>1440</ymax>
</bndbox>
</object>
</annotation>
to txt for YOLO
35 0.5458333333333333 0.5003472222222223 0.5324074074074074 0.9993055555555556
24 0.475 0.5003472222222223 0.9481481481481482 0.9993055555555556
another txt file
1 0.5005208333333333 0.5684895833333333 0.9989583333333333 0.8630208333333333
18 0.659375 0.140625 0.6354166666666666 0.20833333333333334
Do you need classifier or detector? Since on the images is just one object
I need object detection on image and draw bonding box.
Show screenshot of cloud of points:
./darknet detector calc_anchors data/obj.data -num_of_clusters 9 -width 416 -height 416 -show
It shows that Darknet is compiled without GPU
Download the latest Darknet version
Try to use burn_in=5000
in cfg-file
Why without GPU? because image is captured on my Macbook, the parameters are the same as my training server, my training process is on a server, which can not exceute show(-show).
This is the capture from my training server:
I will try it: burn_in=5000
I am using the latest repo which was downloaded around two days ago.
by the way, I have access to another server with 24 cores and 700+GB memory, how to utilize it better?
I started the training process from last 100 iterations save point,with burn_in=5000
, again got nan
174: 933.341797, 780.749146 avg loss, 0.000000 rate, 3.172139 seconds, 11136 images, 17.991696 hours left
175: 971.599854, 799.834229 avg loss, 0.000000 rate, 3.260993 seconds, 11200 images, 17.986480 hours left
176: 941.787659, 814.029541 avg loss, 0.000000 rate, 3.154222 seconds, 11264 images, 17.986201 hours left
177: 937.036682, 826.330261 avg loss, 0.000000 rate, 3.189103 seconds, 11328 images, 17.980036 hours left
178: 965.046997, 840.201904 avg loss, 0.000000 rate, 3.262084 seconds, 11392 images, 17.975844 hours left
179: 946.607117, 850.842407 avg loss, 0.000000 rate, 3.230607 seconds, 11456 images, 17.975704 hours left
180: 960.578308, 861.815979 avg loss, 0.000000 rate, 3.254701 seconds, 11520 images, 17.973824 hours left
181: 590.941345, 834.728516 avg loss, 0.000000 rate, 1.937414 seconds, 11584 images, 17.973280 hours left
182: -nan, -nan avg loss, 0.000000 rate, 2.011170 seconds, 11648 images, 17.923213 hours left
183: -nan, -nan avg loss, 0.000000 rate, 2.055895 seconds, 11712 images, 17.854699 hours left
184: -nan, -nan avg loss, 0.000000 rate, 2.025939 seconds, 11776 images, 17.789327 hours left
185: -nan, -nan avg loss, 0.000000 rate, 2.049610 seconds, 11840 images, 17.722955 hours left
186: -nan, -nan avg loss, 0.000000 rate, 2.035549 seconds, 11904 images, 17.658543 hours left
187: -nan, -nan avg loss, 0.000000 rate, 2.026518 seconds, 11968 images, 17.593997 hours left
When I start from scratch with burn_in=5000
, I got -nan at the 32th iteration.
27: 1094.624878, 1398.332397 avg loss, 0.000000 rate, 1.929200 seconds, 1728 images, 20.005706 hours left
28: 1091.277344, 1367.626953 avg loss, 0.000000 rate, 1.947525 seconds, 1792 images, 19.912686 hours left
29: 1096.390747, 1340.503296 avg loss, 0.000000 rate, 1.932057 seconds, 1856 images, 19.821607 hours left
30: 1083.744751, 1314.827393 avg loss, 0.000000 rate, 1.946099 seconds, 1920 images, 19.730575 hours left
31: 935.924011, 1276.937012 avg loss, 0.000000 rate, 1.675474 seconds, 1984 images, 19.641228 hours left
32: -nan, -nan avg loss, 0.000000 rate, 1.728330 seconds, 2048 images, 19.554481 hours left
33: -nan, -nan avg loss, 0.000000 rate, 1.661758 seconds, 2112 images, 19.454805 hours left
cfg
[net]
# Training
batch=64
subdivisions=16
width=416
height=416
channels=3
momentum=0.949
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1
learning_rate=0.0001
burn_in=5000
max_batches = 20000
policy=steps
steps=16000,18000
scales=.1,.1
...
Can you reproduce Nan issue with small dataset, ~100 - 1000 training images, and share these images? I will try to train.
Also you can try to change max_delta=5
to max_delta=1
for each [yolo] layer in cfg-file
Again, nan appear. Today, I picked 600 images from the dataset, with 120 test images and 480 train images. Nan appears at the the 484th iteration.
472: 12.685884, 18.907265 avg loss, 0.000005 rate, 3.444814 seconds, 30208 images, 14.311615 hours left
473: 14.853024, 18.501841 avg loss, 0.000005 rate, 3.451708 seconds, 30272 images, 14.355366 hours left
474: 27.073414, 19.358997 avg loss, 0.000005 rate, 3.554246 seconds, 30336 images, 14.399043 hours left
475: 16.928411, 19.115938 avg loss, 0.000005 rate, 3.516808 seconds, 30400 images, 14.447836 hours left
476: 15.022050, 18.706549 avg loss, 0.000005 rate, 3.484360 seconds, 30464 images, 14.494101 hours left
477: 19.145966, 18.750490 avg loss, 0.000005 rate, 3.516784 seconds, 30528 images, 14.538133 hours left
478: 9.029431, 17.778385 avg loss, 0.000005 rate, 3.463961 seconds, 30592 images, 14.583474 hours left
479: 17.181927, 17.718739 avg loss, 0.000005 rate, 3.523369 seconds, 30656 images, 14.625488 hours left
480: 9.644224, 16.911287 avg loss, 0.000005 rate, 3.399598 seconds, 30720 images, 14.670293 hours left
481: 16.324453, 16.852604 avg loss, 0.000005 rate, 56.586223 seconds, 30784 images, 14.707929 hours left
482: -nan, -nan avg loss, 0.000005 rate, 48.838414 seconds, 30848 images, 17.643123 hours left
483: -nan, -nan avg loss, 0.000005 rate, 64.476111 seconds, 30912 images, 20.114554 hours left
484: -nan, -nan avg loss, 0.000005 rate, 45.124677 seconds, 30976 images, 23.408916 hours left
This is the whole dataset rubbish dataset(VOC), you can use this subset.py to pick any number of images you want from the whole dataset, it helps you to generate the train.txt and test.txt, you can set maximum number of images to pick in each class, for picking evenly from each class.
rubbish_600.data
classes = 44
train = /home/ma-user/work/trainval/600/train.txt
valid = /home/ma-user/work/trainval/600/test.txt
names = /home/ma-user/work/trainval/VOC2007/JPEGImages/classes.names
backup = backup
Thank you!
Can you share exactly your 600 images and yolo-labels for the purity of the experiment?
It works well. It seems something wrong with you libraries CUDA, cuDNN, OpenCV or with Darknet installation.
darknet.exe detector train data\600-imgs/img.data data\600-imgs\yolov4-rubbish.cfg yolov4.conv.137 -map
in cfg: batch=64 subdivisions=32
Content of img.data
file:
classes= 44
train = data\600-imgs/train600.txt
valid = data\600-imgs/test600.txt
names = data\600-imgs/train_classes.txt
backup = backup/
eval=coco
Thank you very much! I have been using a server with one GPU, I will check it again on other hardware platforms.
If you have an issue with training - no-detections / Nan avg-loss / low accuracy:
My customized dataset which has 44 classes.
-show_imgs
i.e../darknet detector train ... -show_imgs
and look at theaug_...jpg
images, do you see correct truth bounded boxes? no boxes thereyolov4-rubbish.cfg.txt
bad.list
andbad_label.list
if they exist empty file.How to train (to detect your custom objects)
andHow to improve object detection
in the Readme: https://github.com/AlexeyAB/darknet/blob/master/README.md