Modele loss is nan - Githubissues

sebaleme commented 4 years ago

Hi,

I am able to train the salsanext model with your training infrastructure and the Kitti Dataset. Now, we are trying to train the Salsanext with our own data and our own infrastructure (pytorch lightning). First we had to adapt it, because our frame format is different, and after 4 poolings we reached odd numbers which prevents further pooling. So we went for the easier path, removing some layers. We also are not using the preprocessing in a first approach. Now, the data size is matching with the layers, but we have the issue that loss is always NAN. As our data propagate into the layers, the more and more NANs are appearing in the layer results. Of course, we have some initial nans in our lidar frame (all the points which don t have any reflection), but not more than in the Kitti dataset. We found out, that if we replace all the LeakyRelu by normal nn.Relu, this is not the case, and we get a converging training loss. Do you have any idea?

We have some further tests to do to get a clearer idea. For instance, we are using the Adam loss function and not the Crossentropy, and we could also change our learning rate, but the fact that the LeakyRelu is working for Kitti data and not for our data is intriguing. So we thought that maybe you also had similar issues while implementing your model.

TiagoCortinhal commented 4 years ago

Hello @sebaleme!

Could you tell me what the Adam loss functions is? I am not aware of this loss.

One common issue that can give rise to NaN is exploding gradients. This could be due to a high learning rate, lack of normalization, etc. Can you check the flow of your gradients?

sebaleme commented 4 years ago

Hi, Sorry, I meant the optimizer is different. As you have noticed, Adam is not the loss function directly, but the loss function optimizer. So we are both using kind of xentropy (nn.functional.cross_entropy or nn.NLLLoss), but you are using SGD when we are using Adam. Adam is different, but I didn t chose it, so I wouldn t know why we preferred it. We are/can use it without any input parameter, so it seems quite easy, that would be my first argument. Regarding exploding gradient, yes, we are considering this option. There are batch normalizations in resblocks however, but maybe it cannot cover all cases. However, my question was mainly concerning ReLu and LeakyRelu. Or why LeakyRelu is working with Kitti and not with our data. I know, you don t know our data, so it is not easy to answer.

TiagoCortinhal commented 4 years ago

I thought as much, I just wanted to make sure you were not talking about a loss with the same name I didn't know of. Regarding the choice of optimizers, it is a matter of testing. In our case SGD with momentum showed the best results.

Batch normalization has not designed to prevent exploding gradients (albeit it somewhat helps), my only idea without being able to have a look at the data/gradients is that possibly the exploding gradients on the LeakyRelu case are coming from the slope side (i.e. negative side). Assuming this is due to exploding gradients hypothesis.

You also said you are not doing any type of preprocessing. Does this include standartization of the values of the data points?

Best,

TiagoCortinhal / SalsaNext

Modele loss is nan #35