aim-uofa / AdelaiDepth

This repo contains the projects: 'Virtual Normal', 'DiverseDepth', and '3D Scene Shape'. They aim to solve the monocular depth estimation, 3D scene reconstruction from single image problems.
Creative Commons Zero v1.0 Universal
1.06k stars 144 forks source link

loss becomes nan #49

Open erzhu222 opened 2 years ago

erzhu222 commented 2 years ago

lib.utils.logging INFO: [Step 10470/182650] [Epoch 2/50] [multi] loss: nan, time: 5.862533, eta: 11 days, 16:23:31 meanstd-tanh_auxiloss: nan, meanstd-tanh_loss: nan, msg_normal_loss: nan, pairwise-normal-regress-edge_loss: nan, pairwise-normal-regress-plane_loss: nan, ranking-edge_auxiloss: nan, ranking-edge_loss: nan, abs_rel: 0.211080, whdr: 0.087764, group0_lr: 0.001000, group1_lr: 0.001000, 您好,当我在用taskonomy DiverseDepth HRWSI Holopix50k这四个数据集训练的时候,loss变成了nan,请问您在训练的时候有遇到这样的问题吗?如果有应该怎么解决呢?谢谢!下面是我输入的参数 --backbone resnext101 \ --dataset_list taskonomy DiverseDepth HRWSI Holopix50k \ --batchsize 16 \ --base_lr 0.001 \ --use_tfboard \ --thread 8 \ --loss_mode _ranking-edge_pairwise-normal-regress-edge_msgil-normal_meanstd-tanh_pairwise-normal-regress-plane_ranking-edge-auximeanstd-tanh-auxi \ --epoch 50 \ --lr_scheduler_multiepochs 10 25 40 \ --val_step 5000 \ --snapshot_iters 5000 \ --log_interval 10 \

YvanYin commented 2 years ago

I didn't face this issue. You can clip your gradient to avoid this issue.

erzhu222 commented 2 years ago

Thanks very much, I will try! However, I didn't change the code (the latest) and only change the batchsize and thread and use 8 nvidia V100 to train, what batchsize and thread did you set as you train?

guangkaixu commented 2 years ago

The change of batchsize will not cause the loss nan. I ever faced the "loss nan" problem due to the crop operation. If the depth image becomes invalid(0) for the whole image after cropping, the loss will be nan. I will try to debug and avoid it but it may be time-consuming due to the need for 8 nvidia V100 GPUs.

How many iterations have you trained before the loss nan? You can try to clip the gradient to avoid it, or wait for my debugging. Thank you!

erzhu222 commented 2 years ago

Thanks for your reply!The loss became nan after I trained about 12000 iterations (the 3rd epoch), and I see the code you released contains gradient clip, it seems not work.