Loss is not converge - Githubissues

arthurdouillard / CVPR2021_PLOP

Official code of CVPR 2021's PLOP: Learning without Forgetting for Continual Semantic Segmentation

https://arxiv.org/abs/2011.11390

MIT License

145 stars 23 forks source link

Loss is not converge #10

Closed fred206968 closed 3 years ago

fred206968 commented 3 years ago

While I am reimplementing your code with your setting given in the scripts folder, I found the results are a bit lower than the paper results(2%-5%). When I check the tensorboard for the loss, I found that from step 1, the loss is not converging and some of them are NaN.

Have you ever run into this problem?

arthurdouillard commented 3 years ago

No I don't have this problem, but I'm going to need more info:

what is the dataset? VOC?
what is the setting? 15-5? 15-1?
did you use mixed precision? It seems (see #8) that without the loss scaling of apex you may encounter problems

It can also helps if you give me the complete command or script used.

fred206968 commented 3 years ago

I use voc dataset and the task 19-1 and run the plop_19-1.sh for the implementation. The command line keeps outputting the warning message as following: The loss is nan starting from step1

arthurdouillard commented 3 years ago

I'm currently rerunning this script, I don't have much gpus right now, so I may have results in a few days. But I've just rerunned a quick iteration with one epoch per step and I didn't have your problem.

This setting (voc 19-1) has already been reproduced by others (with even better results), so I suspect there is problem on your side.

What are the versions of torch, torchvision, apex, and cuda?

fred206968 commented 3 years ago

Problem Solved. Thanks

arthurdouillard commented 3 years ago

What was the problem?

Your solution may help others that encounter the same problem.

fred206968 commented 3 years ago

The problem is I install the apex without cpp extension

arthurdouillard commented 3 years ago

Good to know, thanks!