Closed fred206968 closed 3 years ago
No I don't have this problem, but I'm going to need more info:
It can also helps if you give me the complete command or script used.
I use voc dataset and the task 19-1 and run the plop_19-1.sh for the implementation. The command line keeps outputting the warning message as following: The loss is nan starting from step1
I'm currently rerunning this script, I don't have much gpus right now, so I may have results in a few days. But I've just rerunned a quick iteration with one epoch per step and I didn't have your problem.
This setting (voc 19-1) has already been reproduced by others (with even better results), so I suspect there is problem on your side.
What are the versions of torch, torchvision, apex, and cuda?
Problem Solved. Thanks
What was the problem?
Your solution may help others that encounter the same problem.
The problem is I install the apex without cpp extension
Good to know, thanks!
While I am reimplementing your code with your setting given in the scripts folder, I found the results are a bit lower than the paper results(2%-5%). When I check the tensorboard for the loss, I found that from step 1, the loss is not converging and some of them are NaN.
Have you ever run into this problem?