GOATmessi8 / ASFF

yolov3 with mobilenet v2 and ASFF
GNU General Public License v3.0
1.05k stars 216 forks source link

LOSS is NaN while training both baseline and ASFF, batchsize16 in 4 V100 #102

Closed kingthreestones closed 1 year ago

kingthreestones commented 2 years ago

Hello,I get trouble in training. The loss turned to “Nan”. I train the baseline and ASFF in 4 V100,the batchsize is 16 according to your paper. here is my command: python -m torch.distributed.launch --nproc_per_node=4 --master_port=10266 main.py --cfg config/yolov3_baseline.cfg -d COCO --tfboard --distributed --ngpu 4 --checkpoint weights/darknet53_feature_mx.pth --start_epoch 0 --half --log_dir log/COCO -s 608

the cfg: image

the tensorboard: image

the log: image

Please help me! Thank you!