Open hjjlovecyy opened 5 years ago
And I find that if I change total_loss.backward()
in train.py to another loss, such as total_reg_loss.backward()
"an illegal memory access was encountered" error no longer happens.
And the result no longer appear nan like my question.
nan still appear, and the model did not converge. this is also related to the batch_size
Maybe you can check your pretrained model.
@hjjlovecyy hello, I also met this problem (nan appears in loss). Have you solved this problem? thanks a lot!!
@hjjlovecyy how to slove the problem of nan, i met this problem too.
@hjjlovecyy How to solve the problem of nan, I met this problem too. Thanks a lot!
I met the same proboelm how to solve that?
Hello, thank u for ur great work!!!! Thanks a lot. After modifying some code, the train.py can run successfully. But the loss is very strange, as follow: 0%| | 0/8 [00:00<?, ?it/s]Epoch0 Iter0 --- total_loss: nan, cls_loss: nan, reg_loss: 0.6299 12%|█▎ | 1/8 [00:02<00:14, 2.01s/it]Epoch0 Iter1 --- total_loss: nan, cls_loss: nan, reg_loss: 1.7844 25%|██▌ | 2/8 [00:03<00:09, 1.61s/it]Epoch0 Iter2 --- total_loss: nan, cls_loss: nan, reg_loss: 34.9781 38%|███▊ | 3/8 [00:04<00:07, 1.48s/it]Epoch0 Iter3 --- total_loss: nan, cls_loss: nan, reg_loss: 238.0343 50%|█████ | 4/8 [00:05<00:05, 1.41s/it]Epoch0 Iter4 --- total_loss: nan, cls_loss: nan, reg_loss: 236.5256 62%|██████▎ | 5/8 [00:06<00:04, 1.37s/it]Epoch0 Iter5 --- total_loss: nan, cls_loss: nan, reg_loss: 70.5485 75%|███████▌ | 6/8 [00:08<00:02, 1.35s/it]Epoch0 Iter6 --- total_loss: nan, cls_loss: nan, reg_loss: 113.4333 88%|████████▊ | 7/8 [00:09<00:01, 1.33s/it]Epoch0 Iter7 --- total_loss: nan, cls_loss: nan, reg_loss: 56.8303 100%|██████████| 8/8 [00:10<00:00, 1.25s/it] Saving model...
Have u met it before? Thanks.