ClementPinard / SfmLearner-Pytorch

Pytorch version of SfmLearner from Tinghui Zhou et al.
MIT License
1.01k stars 224 forks source link

Reconstruction loss as NaN #17

Closed asprasan closed 6 years ago

asprasan commented 6 years ago

When I went through the function to calculate photometric reconstruction loss I found this line of code assert((reconstruction_loss == reconstruction_loss).data[0] == 1) I figured out that this line is there to check if reconstruction loss is NaN. But I couldn't quite figure out why are we doing this. Under what circumstances could reconstruction loss reach NaN.

I'm trying to train only the pose network. I'm using the depth map obtained from Kinect camera in place of training the DispNet. I'm getting assertion error randomly in run time. I'm unable to figure out the cause. I wan to know why we are checking for NaNs in reconstruction loss. Also what are the possible causes for reconstruction loss to reach NaN?

ClementPinard commented 6 years ago

A NaN reconstruction Loss causes all the gradients to be NaN. Even if only one value is NaN over the whole diff map, you will get NaN at the next optimizer step, so you really want to avoid that !

This line is here to help you figure out what goes wrong if you get to a NaN training loss, since as soon as this gets to NaN, your network is basically bound to output NaN until the end.

Now on how it got to be NaN depends on your problem, I advise you to identify a seed on which it appears each time and try to find where a first NaN appears.

My guess is maybe when computing u,v coordinates here https://github.com/ClementPinard/SfmLearner-Pytorch/blob/master/inverse_warp.py#L65 since we divide by a Z value, and when it's 0 you get NaN.

asprasan commented 6 years ago

I do understand that once loss goes to NaN, the whole training becomes pointless. Gradients go to NaN and there's no way of coming back.

However, I was intrigued by the fact that checking for loss to be NaN is done only for the photometric reconstruction loss. Other loss functions are not checked for NaN. So, I was wondering whether you encountered any special scenario in which photometric reconstruction loss became NaN?

The Z value in the "inverse_warp.py" is being clamped to 1e-3. Hence I don't see any chance under which a divide by zero will occur. The kinect depth that I'm using has lot of zeros. If division with zero was the issue then it should have happened at every iteration. Could this be due to some overflow error?

ClementPinard commented 6 years ago

other loss function are actually much simpler since their target value is fixed (smooth loss and explainability loss) but you can check for them too.

you can also try discard 0 values in your Kinect because it can cause high distance to warp for a particular translation.

The other potnetial source of NaN can be Adam optimizer which has a second order term which can diverge if your learning rate is too, you should check for weights values after optimizer step too.

asprasan commented 6 years ago

The 0 values in Kinect are either due to objects are very far away or they don't reflect the IR light that is projected. Currently I'm not taking care for that in the warping part. However I'm setting the photometric loss to zero at those pixels where the depth is zero.

Previously I have faced issue with Adam optimizer. However, it may not be the case here because only the photometric loss is going to NaN. Other losses are all within reasonable range.

I will have a more careful look at the code and try to log things properly whenever I encounter the loss to be NaN.

versatran01 commented 6 years ago

I tried to reimplement this and also got nans in photometric reconstruction loss. It only happened in the monocular case, not the stereo one. It is very annoying that training would suddenly die. I haven't tried to clamp the depth, hopefully that will fix it.

anuragranj commented 6 years ago

Well, the depth computation is monocular here. What do you mean by the stereo case? Are the NaNs because of zero depth?

asprasan commented 6 years ago

It's definitely not because of zero depth. Depth is being clamped to 0.001 before being divided. One reason could be some overflow/underflow of the numbers.

In the LSD-SLAM paper authors modify inverse depth such that the mean inverse depth is 1 in every iteration. May be that could solve the problem.

It's definitely very annoying to see that the training stops abruptly and we can't figure out what's wrong.

versatran01 commented 6 years ago

I was talking about my own implementation, apologize for the confusion. Originally I did not clamp depth to nonzero and I got nans, but after clamping it never happen again.