Training issue - Gaussian model will appear Loss=NAN

tingyangsh commented 4 years ago

  Hi, thank you for your contribution. I trained my data set in https://github.com/jwchoi384/Gaussian_YOLOv3 and it can be trained normally, but when I train Gaussian modlel under your darknet file, NAN will appear.

Part of the code in Gaussian-test.cfg: [convolutional] 174 size=1 175 stride=1 176 pad=1 177 filters=42 178 activation=linear 179 180 [Gaussian_yolo] 181 mask = 0,1,2 182 anchors = 10,14, 23,27, 37,58, 81,82, 135,169, 344,319 183 classes=5 184 num=6 185 jitter=.3 186 ignore_thresh = .7 187 truth_thresh = 1 188 iou_thresh=0.213 189 uc_normalizer=0.01 190 iou_normalizer=0.01 191 cls_normalizer=1.0 192 #iou_loss=ciou 193 scale_x_y = 1.2 194 random=1

My training command is ./darknet detector train cfg/voc_hrsc5.data cfg/Gaussian-test.cfg. Some error messages are as follows: Region 23 Avg IOU: 0.000000, Class: nan, Obj: nan, No Obj: nan, .5R: 0.000000, .75R: 0.000000, count: 1, class_loss = -nan, iou_loss = -nan, uc_loss = -nan, total_loss = -nan Region 16 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000000, .5R: -nan, .75R: -nan, count: 0, class_loss = 0.00, iou_loss = 0.00, uc_loss = 0.00 , total_loss = 0.00 Region 23 Avg IOU: 0.000000, Class: nan, Obj: nan, No Obj: nan, .5R: 0.000000, .75R: 0.000000, count: 1, class_loss = -nan, iou_loss = -nan, uc_loss = -nan, total_loss = -nan

Tensor Cores are disabled until the first 3000 iterations are reached.

72: -nan, -nan avg loss, 0.000000 rate, 0.353376 seconds, 4608 images, 105.484449 hours left Loaded: 0.000025 seconds

Warning: in txt-labels class_id=240567840 >= classes=5 in cfg-file. In txt-labels class_id should be [from 0 to 4] truth.x = 0.000000, truth.y = 0.000003, truth.w = 0.000001, truth.h = 274706242392578588672.000000, class_id = 240567840

Warning: in txt-labels class_id=193 >= classes=5 in cfg-file. In txt-labels class_id should be [from 0 to 4] truth.x = 0.000951, truth.y = 148378222592.000000, truth.w = 18492160923342970278921038200832.000000, truth.h = 15513.363281, class_id = 193

WongKinYiu commented 4 years ago

Do you use same learning rate and pre-trained weights?

tingyangsh commented 4 years ago

Yes, the same training parameters and data are used. But I have no use to use pre-trained weights for them

WongKinYiu commented 4 years ago

original gaussian yolo uses very low learning rate https://github.com/jwchoi384/Gaussian_YOLOv3/blob/master/cfg/Gaussian_yolov3_BDD.cfg#L18

and i have not train gaussian yolo success without pre-trained weights (usually get nan in 200 steps).

colinlin1982 commented 4 years ago

gaussian yolo may run into nan if training set have empty annotation files. https://github.com/AlexeyAB/darknet/issues/4455#issuecomment-564333775

lq0104 commented 4 years ago

@tingyangsh Hello, I met the same question when I train the model with the Gaussian yolo layer with the newest darknet repo, I have the similiar message like this: "Warning: in txt-labels class_id=193 >= classes=5 in cfg-file. In txt-labels class_id should be [from 0 to 4] truth.x = 0.000951, truth.y = 148378222592.000000, truth.w = 18492160923342970278921038200832.000000, truth.h = 15513.363281, class_id = 193" But if I train the same model with the repo when I downloaded in 2020.02.20 in the same computer, everything is OK, is it a bug or something else? @WongKinYiu @AlexeyAB My GPU is Geforce RTX 2080Ti, Cuda version is 10.1.243, Cudnn version is 7.6.2 thanks

lq0104 commented 4 years ago

Update: Today I did some experiments about this training problem. I download the repo of https://github.com/AlexeyAB/darknet/releases/tag/darknet_yolo_v4_pre (release 2020.05.15) and train the model with Gaussian_Yolo layer. Everything is OK, no NAN info and the MAP value is normal. I also compare "gaussian_yolo_layer.c" between the repo 20200515 and the newest repo and there is little difference between them, I replaced the new "gaussian_yolo_layer.c" with the version 20200515, after make clean & make, I try to train again, but the problem is still existed. So My infer that the reason raised this problem may happened after 20200515, and It's not raised by the file "gaussian_yolo_layer.c". So I want to know, is't normal when you train with model with Gaussian_Yolo layer with the newest repo? thank you very much! @WongKinYiu @AlexeyAB

wenchao1993 commented 3 years ago

@tingyangsh Hello, I met the same question when I train the model with the Gaussian yolo layer with the newest darknet repo, I have the similiar message like this: "Warning: in txt-labels class_id=193 >= classes=5 in cfg-file. In txt-labels class_id should be [from 0 to 4] truth.x = 0.000951, truth.y = 148378222592.000000, truth.w = 18492160923342970278921038200832.000000, truth.h = 15513.363281, class_id = 193" But if I train the same model with the repo when I downloaded in 2020.02.20 in the same computer, everything is OK, is it a bug or something else? @WongKinYiu @AlexeyAB My GPU is Geforce RTX 2080Ti, Cuda version is 10.1.243, Cudnn version is 7.6.2 thanks

Hi ,lq0104,I met the same question with you ,May I share your repo when you downloaded in 2020.02.20 ,I am struggling to make gaussian yolo work ,may e-mai is :zhang_wenchao1@163.com thank you very much !

wenchao1993 commented 3 years ago

Update: Today I did some experiments about this training problem. I download the repo of https://github.com/AlexeyAB/darknet/releases/tag/darknet_yolo_v4_pre (release 2020.05.15) and train the model with Gaussian_Yolo layer. Everything is OK, no NAN info and the MAP value is normal. I also compare "gaussian_yolo_layer.c" between the repo 20200515 and the newest repo and there is little difference between them, I replaced the new "gaussian_yolo_layer.c" with the version 20200515, after make clean & make, I try to train again, but the problem is still existed. So My infer that the reason raised this problem may happened after 20200515, and It's not raised by the file "gaussian_yolo_layer.c". So I want to know, is't normal when you train with model with Gaussian_Yolo layer with the newest repo? thank you very much! @WongKinYiu @AlexeyAB

When I use the pro of "darknet_yolo_v4_pre" you shared to train gaussian_yolo ,it does not use GPU , Do you meet the same things ? thank you .

AlexeyAB / darknet

Training issue - Gaussian model will appear Loss=NAN #6314