I trained the coco2017 dataset on google cloud platform with yolov4, and the loss value changed to-nan.

Philharmy-Wang commented 3 years ago

I used yolov4 to train coco2017 data sets on google cloud platform. After about 400 steps, the loss value is about 31. After about 900 steps, the loss value begins to rise, and finally the loss value changes to-nan.

The gpu I use is Tesla v100. Train environment : ubuntu18.04, cudnn7.6.5, cudnn7.6.5, cuda10.2,opencv3.4.4. The training command I use are ./darknet detector train cfg/coco.data cfg/yolov4.cfg -map . I used coco2017 & coco2014 dataset for training, the loss is -nan, and I without get the mAP value. This is the Makefile I used:

GPU=1
CUDNN=1
CUDNN_HALF=1
OPENCV=1
AVX=0
OPENMP=0
LIBSO=0
ZED_CAMERA=0
ZED_CAMERA_v2_8=0

USE_CPP=0
DEBUG=0
ARCH= -gencode arch=compute_70,code=[sm_70,compute_70]

OS := $(shell uname)

This is the cfg file I used:

[net]
batch=64
subdivisions=8
# Training
#width=512
#height=512
width=416
height=416
channels=3
momentum=0.949
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1

learning_rate=0.0013
burn_in=1000
max_batches = 200000
policy=steps
steps=160000,180000
scales=.1,.1

...

[yolo]
mask = 6,7,8
anchors = 12, 16, 19, 36, 40, 28, 36, 75, 76, 55, 72, 146, 142, 110, 192, 243, 459, 401
classes=80
num=9
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=0
scale_x_y = 1.05
iou_thresh=0.213
cls_normalizer=1.0
iou_normalizer=0.07
iou_loss=ciou
nms_kind=greedynms
beta_nms=0.6
max_delta=5

This is the chart.png:

chart

Philharmy-Wang commented 3 years ago

I use ./darknet/scripts/get_coco2017.sh and ./darknet/scripts/get_coco_dataset.sh to get the coco dataset

Diaislam commented 3 years ago

It seems that the new commit has this bug i have faced the same problem and solved it by checkout an old commit. just checkout this commit 8c9c5171891ea92b0cbf5c7fddf935df0b854540 It will work.

Philharmy-Wang commented 3 years ago

It seems that the new commit has this bug i have faced the same problem and solved it by checkout an old commit. just checkout this commit 8c9c517 It will work.

Ok ! I will try ~.~ Thank you very much!!

parneetk commented 3 years ago

@Philharmy-Wang I am facing a similar issue. Did changing to old commit resolve the issue?

AlexeyAB / darknet

I trained the coco2017 dataset on google cloud platform with yolov4, and the loss value changed to-nan. #7093