Training issue - Githubissues

janicepan commented 5 years ago

Has anyone dealt with the model only predicting 1s? I cannot train this model as it is written on the NYU dataset. The validation predictions that are generated while training just come out as constants.

angshine commented 5 years ago

I think a few settings can cause this situation,

the alpha and beta to decode and encode the SID are not set properly, the discretization spread too much on the smaller distances. (e.g. 0~10m of the 80m for kitti)
some layers output might be NaN, which is mostly caused by the improper parameters of BatchNorm.
if you are trying to overfit a small part of the dataset, set all dropout to eval (or delete them) might help.

janicepan commented 5 years ago

@angshine Thank you for the reply! I appreciate the suggestions! Were you able to successfully train it on either KITTI or NYU?

I tried deleting the dropout steps, changing the alpha and beta, and changing K, but I am still getting a constant output.

angshine commented 5 years ago

@janicepan I can train it on KITTI and haven't tried on NYU. I met the outputting constant problem while trying to reproduce the result, and I think my problem was setting the improper beta and alpha. Are you training from scratch or using the weight pretrained on ImageNet?

janicepan commented 5 years ago

@angshine I am using the pretrained weights from the pretrained resnet model. I did adjust the alpha and beta based on the min and max range values. Is that correct? How did you find the proper values?

angshine commented 5 years ago

@janicepan Have you checked each layers' output and see if there exists any nan value? When I debugged this error, I found that the first few layers' output of the backbone have lots of nan because the pre-trained weights I was using is incorrect.

janicepan commented 5 years ago

Thanks again @angshine for the suggestion to check the layer outputs. Through my tests, I didn't find any nan outputs, but I am still unable to train it. Is the default small batch size of 6 working for you? With such a small batch size, I found that I need to use a very very small learning rate in order for the network to not converge immediately to outputting constant images, and the results don't end up looking like anything. I also cannot use a larger batch size, because I run into memory issues with how large the network is. Did you (or anyone else who might come across this post) run into similar issues?

comparison_best

LCJHust commented 4 years ago

Thanks again @angshine for the suggestion to check the layer outputs. Through my tests, I didn't find any nan outputs, but I am still unable to train it. Is the default small batch size of 6 working for you? With such a small batch size, I found that I need to use a very very small learning rate in order for the network to not converge immediately to outputting constant images, and the results don't end up looking like anything. I also cannot use a larger batch size, because I run into memory issues with how large the network is. Did you (or anyone else who might come across this post) run into similar issues?

Hi，I met the same problem with you when I trained on KITTI, just like you described , it converge so fast and output constant images. Have you solved it? Thank you.

LCJHust commented 4 years ago

Hi，I met the same problem with you when I trained on KITTI, just like you described , it converge so fast and output constant images. Have you solved it? Thank you.

JingweiZhang12 commented 4 years ago

@janicepan @LCJHust I also met the same issue. Have you solved the problems? I trained the model on NYUv2 dataset and the output of the model is constant. So weird! This is the picture: comparison_best

The predicted depth values:

dontLoveBugs commented 4 years ago

Hi, everyone, I update the implementation of dorn and solve the output problem.

kk6398 commented 1 year ago

Hey,guys, I meet a difficult problem: I set the '-- c' parameter to the path of 'resnet101_v1c. pth' and attempt to run train.py, but encounter the following error: ruamel. yaml. reader. ReaderError: unacceptable character # x0080: invalid start byte. Could you help me deal this question? Is this step correct? Thanks a lot.

LCJHust commented 1 year ago

这是来自QQ邮箱的假期自动回复邮件。您好，我已收到您的邮件，将尽快回复。

dontLoveBugs / DORN_pytorch

Training issue #17