NaN values in loss function during Step 2.2

khaep commented 3 years ago

Hey @chenwydj ,

during the step 2.2 with custom labels, the value of the loss function leads to "NaN" in the last few steps (see image below).

Everything appers to be ok till epoch 341.

From epoch 351 on, something is wrong with the mIoU. All labels except "unlabeld" reach a mIoU of 0%.

It seems to be a numerical stability problem. Do you agree with that? Do you have an idea to solve this problem?

It would be great if you can give me some hints to answer this questions.

Kind regards.

Gaussianer commented 3 years ago

Hello together (@KHeap25 @chenwydj ), I have exactly the same problem. I also get NaN values during the training at step 2.2 at the Loss value.

What is the reason? Have you been able to identify the problem @KHeap25 ?

I hope I get a solution here. We would like to publish soon.

khaep commented 3 years ago

Hey @Gaussianer Hey @chenwydj,

I have been running the last train step (step 2.2) agian on two identically GPU's indipendent from each other. The result is that one GPU has finished without the NaN error and the other one (same as above ) gets the NaN values again.

So it seems to be a problem with my hardware / platform.

Furthermore, I can't say which influence labels have to the train process, that are listet in labels.py but not included in the pictures in train and val dataset. Maybe you can give a hint, that can be interesting for the people which are working with custom data.

If anybody will face this error/behavior again, feel free to open a new issue and cite this comments here. Perhabs we can find the origin of the NaN error.

VITA-Group / FasterSeg

NaN values in loss function during Step 2.2 #54