After many steps the loss becomes NaN. (loss = Nan)

famunir commented 4 years ago

Hello, I have been training the pixel link on my own data. I have tried various settings (e.g. different batch sizes, different probability of rotation, minimum side length, etc.) but I end up having NaN loss. Any ideas regarding the reasons of this issue. However, when I train on ICDAR 2015, the training goes fine. Any clue to solve this issue is appreciated. I am only using 1 GPU (6GB) to train and I have adjusted the related parameters. To avoid out of memory (OOM) issue of gpu, the maximum batch size I use is 6.

Thank you for your help.

shiyihan commented 4 years ago

Hello，I have the same problem.I just trained the pixellink on ICDAR2015,and i just change the data path.But the training loss become 0.000 when the step is 4.Could you tell me when you trained on ICDAR2015,have you change other setting?Or maybe do you have the same problem when you trained on ICDAR2015?

famunir commented 4 years ago

Hi, I also encountered the problem you explained. There were 2 issues in my training and following are the solutions that I applied. 1) When I trained the model for my own data, the training loss in start was very low and converged to 0.000 very quickly as you explained. Solution: I hope there will be no issue in the directory changing. If you have a look at the training data text files, it looked something like this (x1,y1,x2,y2,x3,y3,x4,y4,###). In place of ###, words are presented as in case of ICDAR15. Now if you have the same thing in you data, i.e., ### placed at the end of each line, replace this ### with some random word string. As, this pipeline is for detection only and not for recognition, it does not matter what string or word you put in place of ###. Cause: I don't know the exact cause for this till now but I feel some issue occurs while converting ### to tf records.

2) Training loss going to Nan. Solution: I trained the model till the model diverged to Nan. Then, I reduced the learning rate to 10^-4 or 10^-5 and used the last valid checkpoint before Nan occurred. Cause: I am still trying to figure out. This is a strange problem indeed because the code works fine on ICDAR15 dataset but causes issue on my own dataset. I hope this solves your issue.

ZJULearning / pixel_link

After many steps the loss becomes NaN. (loss = Nan) #164