Nan values for loss and accuracy in training and testing.

adityagupta-9900 commented 3 years ago

As per our capacity, we reduced it to 4 GPUs and kept the learning rate as default 0.02. After 40-60 iterations we started getting Nan value of losses We reduced the rate to 0.015 and then trained. Even with this for >200 iterations, sometimes it shows Nan loss values, sometimes it runs fine. Even after testing with Nan loss values, we found all the accuracy values in the output table came out to be Nan.

WhatsApp Image 2021-09-04 at 22 48 10

mondrasovic commented 2 years ago

Currently, I am working on the solution to this. So far I have identified that the ground truth bounding boxes disappear during the filtering phase (code here).

As you can see on the report, everything is labeled as FP - False Positive. But why? I still have no idea, but it is one of the tasks I have to solve before moving forward.

There is another issue with the very same problem.

mondrasovic commented 2 years ago

You can see my answer in this issue related to this project for the solution.

adityagupta-9900 commented 2 years ago

@mondrasovic Thank you so much for your help. But there is another thing I needed to clarify. I'm getting Nan values even for training. Like after 40-60 iterations, I start getting Nan(Nan) loss values in training MOT data-set. Could you suggest why would it come like that?

mondrasovic commented 2 years ago

@mondrasovic Thank you so much for your help. But there is another thing I needed to clarify. I'm getting Nan values even for training. Like after 40-60 iterations, I start getting Nan(Nan) loss values in training MOT data-set. Could you suggest why would it come like that?

I am sorry that I only answered one part of the question. I was stuck on that problem so that caused the narrow-mindedness. Nevertheless, here are my suggestions, because I also experienced the same issues.

Since I am a mortal, I do not have 8 GPUs on my local machine. As a result, I had to reduce the batch size significantly to handle the training. Well, once that happens, the learning rate should be adjusted accordingly, too. Thus, the only time when I experienced what you described with this architecture was when the learning rate was too high and that produced exploding gradients.

I would bet more than just two cents on this because you mentioned that you used 4 GPUs and yet maintained the learning rate. Even the description of the progress of your training is highly indicative of this cause.

Currently, I am using

BASE_LR: 0.002

and it works.

Hope this helps!

adityagupta-9900 commented 2 years ago

@mondrasovic
Thank you so much. It helped a lot.

amazon-science / siam-mot

Nan values for loss and accuracy in training and testing. #23