LOSS is NaN while training both baseline and ASFF, batchsize16 in 4 V100

Hello,I get trouble in training. The loss turned to “Nan”. I train the baseline and ASFF in 4 V100,the batchsize is 16 according to your paper. here is my command： python -m torch.distributed.launch --nproc_per_node=4 --master_port=10266 main.py --cfg config/yolov3_baseline.cfg -d COCO --tfboard --distributed --ngpu 4 --checkpoint weights/darknet53_feature_mx.pth --start_epoch 0 --half --log_dir log/COCO -s 608

the cfg:

the tensorboard:

the log:

Please help me! Thank you!

GOATmessi8 / ASFF

LOSS is NaN while training both baseline and ASFF, batchsize16 in 4 V100 #102