facebookresearch / DistDepth

Repository for "Toward Practical Monocular Indoor Depth Estimation" (CVPR 2022)
Other
216 stars 20 forks source link

Why the losses do not fall, it has been oscillating. #25

Closed 2579690686 closed 11 months ago

2579690686 commented 1 year ago

batchsize = 6, thanks

choyingw commented 11 months ago

Hi, could you provide the loss curve or screenshot, and hyperparamters you set

2579690686 commented 11 months ago

Thank you very much for your reply!

python execute.py --exe train --model_name distdepth-distilled --frame_ids 0 -1 1 --log_dir='./tmp' --data_path D:\sim --dataset SimSIN --batch_size 4 --width 256 --height 256 --max_depth 10.0 --num_epochs 10 --scheduler_step_size 8 --learning_rate 0.0001 --thre 0.95 --num_layers 152 --log_frequency 25

I tried adjusting the learning rate, but it still fluctuated slightly.

Due to GPU limitations, my batch size can only go up to 6, will this be affected?

2579690686 commented 11 months ago

epoch 0 | batch 0 | examples/s: 0.6 | loss: 0.51127 | time elapsed: 00h00m07s | time left: 00h00m00s epoch 0 | batch 25 | examples/s: 8.4 | loss: 0.56632 | time elapsed: 00h00m36s | time left: 302h15m08s epoch 0 | batch 50 | examples/s: 10.1 | loss: 0.64960 | time elapsed: 00h01m03s | time left: 267h42m16s epoch 0 | batch 75 | examples/s: 10.0 | loss: 0.53189 | time elapsed: 00h01m31s | time left: 253h52m24s epoch 0 | batch 100 | examples/s: 10.0 | loss: 0.59236 | time elapsed: 00h01m57s | time left: 246h08m43s epoch 0 | batch 125 | examples/s: 9.8 | loss: 0.57938 | time elapsed: 00h02m24s | time left: 241h46m11s epoch 0 | batch 150 | examples/s: 9.8 | loss: 0.58601 | time elapsed: 00h02m50s | time left: 238h07m57s epoch 0 | batch 175 | examples/s: 9.6 | loss: 0.56136 | time elapsed: 00h03m17s | time left: 236h24m06s epoch 0 | batch 200 | examples/s: 8.7 | loss: 0.57973 | time elapsed: 00h03m44s | time left: 234h45m47s epoch 0 | batch 225 | examples/s: 9.9 | loss: 0.62872 | time elapsed: 00h04m10s | time left: 233h01m10s epoch 0 | batch 250 | examples/s: 9.6 | loss: 0.51845 | time elapsed: 00h04m36s | time left: 231h22m07s epoch 0 | batch 275 | examples/s: 9.7 | loss: 0.60010 | time elapsed: 00h05m01s | time left: 229h37m30s epoch 0 | batch 300 | examples/s: 9.7 | loss: 0.60359 | time elapsed: 00h05m28s | time left: 228h42m39s epoch 0 | batch 325 | examples/s: 9.7 | loss: 0.60311 | time elapsed: 00h05m54s | time left: 227h47m06s epoch 0 | batch 350 | examples/s: 9.7 | loss: 0.61501 | time elapsed: 00h06m19s | time left: 226h44m33s epoch 0 | batch 375 | examples/s: 9.6 | loss: 0.60289 | time elapsed: 00h06m45s | time left: 226h12m44s epoch 0 | batch 400 | examples/s: 9.6 | loss: 0.59890 | time elapsed: 00h07m11s | time left: 225h39m54s epoch 0 | batch 425 | examples/s: 9.9 | loss: 0.58350 | time elapsed: 00h07m37s | time left: 225h11m11s epoch 0 | batch 450 | examples/s: 9.8 | loss: 0.63595 | time elapsed: 00h08m03s | time left: 224h47m30s epoch 0 | batch 475 | examples/s: 9.7 | loss: 0.60160 | time elapsed: 00h08m29s | time left: 224h20m13s epoch 0 | batch 500 | examples/s: 9.6 | loss: 0.60722 | time elapsed: 00h08m55s | time left: 223h55m25s epoch 0 | batch 525 | examples/s: 9.1 | loss: 0.57381 | time elapsed: 00h09m22s | time left: 223h53m39s

Below are the results of my previous training, and the loss has been oscillating in the range of 0.5-0.6. tensor(0.5789, device='cuda:0', grad_fn=) 502305 tensor(0.5753, device='cuda:0', grad_fn=) 502306 tensor(0.5608, device='cuda:0', grad_fn=) 502307 tensor(0.6003, device='cuda:0', grad_fn=) 502308 tensor(0.5488, device='cuda:0', grad_fn=) 502309 tensor(0.5893, device='cuda:0', grad_fn=) 502310 tensor(0.5723, device='cuda:0', grad_fn=) 502311 tensor(0.5935, device='cuda:0', grad_fn=) 502312 tensor(0.5904, device='cuda:0', grad_fn=) 502313 tensor(0.5735, device='cuda:0', grad_fn=) 502314 tensor(0.5512, device='cuda:0', grad_fn=) 502315 tensor(0.5662, device='cuda:0', grad_fn=) 502316 tensor(0.5907, device='cuda:0', grad_fn=) 502317 tensor(0.5671, device='cuda:0', grad_fn=) 502318 tensor(0.5786, device='cuda:0', grad_fn=) 502319 tensor(0.5735, device='cuda:0', grad_fn=) 502320 tensor(0.5926, device='cuda:0', grad_fn=) 502321 tensor(0.5640, device='cuda:0', grad_fn=) 502322 tensor(0.5468, device='cuda:0', grad_fn=) 502323 tensor(0.6017, device='cuda:0', grad_fn=) 502324 tensor(0.5932, device='cuda:0', grad_fn=) 502325 tensor(0.5776, device='cuda:0', grad_fn=) 502326 tensor(0.5854, device='cuda:0', grad_fn=) 502327 tensor(0.5887, device='cuda:0', grad_fn=) 502328 tensor(0.5909, device='cuda:0', grad_fn=) 502329 tensor(0.5675, device='cuda:0', grad_fn=) 502330 tensor(0.5910, device='cuda:0', grad_fn=) 502331 tensor(0.6165, device='cuda:0', grad_fn=) 502332 tensor(0.5913, device='cuda:0', grad_fn=) 502333 tensor(0.5844, device='cuda:0', grad_fn=) 502334 tensor(0.5771, device='cuda:0', grad_fn=) 502335 tensor(0.5866, device='cuda:0', grad_fn=) 502336 tensor(0.5686, device='cuda:0', grad_fn=) 502337 tensor(0.5570, device='cuda:0', grad_fn=)

2579690686 commented 11 months ago

epoch 0 | batch 0 | examples/s: 0.6 | loss: 0.51127 | time elapsed: 00h00m07s | time left: 00h00m00s epoch 0 | batch 25 | examples/s: 8.4 | loss: 0.56632 | time elapsed: 00h00m36s | time left: 302h15m08s epoch 0 | batch 50 | examples/s: 10.1 | loss: 0.64960 | time elapsed: 00h01m03s | time left: 267h42m16s epoch 0 | batch 75 | examples/s: 10.0 | loss: 0.53189 | time elapsed: 00h01m31s | time left: 253h52m24s epoch 0 | batch 100 | examples/s: 10.0 | loss: 0.59236 | time elapsed: 00h01m57s | time left: 246h08m43s epoch 0 | batch 125 | examples/s: 9.8 | loss: 0.57938 | time elapsed: 00h02m24s | time left: 241h46m11s epoch 0 | batch 150 | examples/s: 9.8 | loss: 0.58601 | time elapsed: 00h02m50s | time left: 238h07m57s epoch 0 | batch 175 | examples/s: 9.6 | loss: 0.56136 | time elapsed: 00h03m17s | time left: 236h24m06s epoch 0 | batch 200 | examples/s: 8.7 | loss: 0.57973 | time elapsed: 00h03m44s | time left: 234h45m47s epoch 0 | batch 225 | examples/s: 9.9 | loss: 0.62872 | time elapsed: 00h04m10s | time left: 233h01m10s epoch 0 | batch 250 | examples/s: 9.6 | loss: 0.51845 | time elapsed: 00h04m36s | time left: 231h22m07s epoch 0 | batch 275 | examples/s: 9.7 | loss: 0.60010 | time elapsed: 00h05m01s | time left: 229h37m30s epoch 0 | batch 300 | examples/s: 9.7 | loss: 0.60359 | time elapsed: 00h05m28s | time left: 228h42m39s epoch 0 | batch 325 | examples/s: 9.7 | loss: 0.60311 | time elapsed: 00h05m54s | time left: 227h47m06s epoch 0 | batch 350 | examples/s: 9.7 | loss: 0.61501 | time elapsed: 00h06m19s | time left: 226h44m33s epoch 0 | batch 375 | examples/s: 9.6 | loss: 0.60289 | time elapsed: 00h06m45s | time left: 226h12m44s epoch 0 | batch 400 | examples/s: 9.6 | loss: 0.59890 | time elapsed: 00h07m11s | time left: 225h39m54s epoch 0 | batch 425 | examples/s: 9.9 | loss: 0.58350 | time elapsed: 00h07m37s | time left: 225h11m11s epoch 0 | batch 450 | examples/s: 9.8 | loss: 0.63595 | time elapsed: 00h08m03s | time left: 224h47m30s epoch 0 | batch 475 | examples/s: 9.7 | loss: 0.60160 | time elapsed: 00h08m29s | time left: 224h20m13s epoch 0 | batch 500 | examples/s: 9.6 | loss: 0.60722 | time elapsed: 00h08m55s | time left: 223h55m25s epoch 0 | batch 525 | examples/s: 9.1 | loss: 0.57381 | time elapsed: 00h09m22s | time left: 223h53m39s epoch 0 | batch 550 | examples/s: 10.0 | loss: 0.61983 | time elapsed: 00h09m48s | time left: 223h49m04s epoch 0 | batch 575 | examples/s: 9.4 | loss: 0.54653 | time elapsed: 00h10m14s | time left: 223h24m32s epoch 0 | batch 600 | examples/s: 9.9 | loss: 0.61790 | time elapsed: 00h10m41s | time left: 223h24m00s epoch 0 | batch 625 | examples/s: 9.8 | loss: 0.55278 | time elapsed: 00h11m07s | time left: 223h07m26s epoch 0 | batch 650 | examples/s: 9.9 | loss: 0.62031 | time elapsed: 00h11m34s | time left: 223h12m23s epoch 0 | batch 675 | examples/s: 9.7 | loss: 0.53466 | time elapsed: 00h11m59s | time left: 222h46m30s epoch 0 | batch 700 | examples/s: 9.8 | loss: 0.56861 | time elapsed: 00h12m25s | time left: 222h31m33s epoch 0 | batch 725 | examples/s: 9.9 | loss: 0.55777 | time elapsed: 00h12m51s | time left: 222h23m08s epoch 0 | batch 750 | examples/s: 9.9 | loss: 0.59033 | time elapsed: 00h13m17s | time left: 222h16m51s epoch 0 | batch 775 | examples/s: 9.8 | loss: 0.59776 | time elapsed: 00h13m44s | time left: 222h20m22s epoch 0 | batch 800 | examples/s: 9.8 | loss: 0.57714 | time elapsed: 00h14m09s | time left: 221h53m25s epoch 0 | batch 825 | examples/s: 9.9 | loss: 0.57735 | time elapsed: 00h14m34s | time left: 221h36m53s epoch 0 | batch 850 | examples/s: 9.9 | loss: 0.55686 | time elapsed: 00h15m00s | time left: 221h21m58s epoch 0 | batch 875 | examples/s: 9.7 | loss: 0.58286 | time elapsed: 00h15m27s | time left: 221h27m06s epoch 0 | batch 900 | examples/s: 9.9 | loss: 0.57009 | time elapsed: 00h15m54s | time left: 221h32m48s epoch 0 | batch 925 | examples/s: 9.9 | loss: 0.62446 | time elapsed: 00h16m19s | time left: 221h18m11s epoch 0 | batch 950 | examples/s: 9.8 | loss: 0.57562 | time elapsed: 00h16m45s | time left: 221h04m40s epoch 0 | batch 975 | examples/s: 9.8 | loss: 0.59325 | time elapsed: 00h17m11s | time left: 220h56m19s epoch 0 | batch 1000 | examples/s: 9.7 | loss: 0.57069 | time elapsed: 00h17m37s | time left: 221h01m29s epoch 0 | batch 1025 | examples/s: 10.0 | loss: 0.56339 | time elapsed: 00h18m03s | time left: 220h54m44s epoch 0 | batch 1050 | examples/s: 9.7 | loss: 0.51971 | time elapsed: 00h18m29s | time left: 220h50m31s epoch 0 | batch 1075 | examples/s: 9.7 | loss: 0.58701 | time elapsed: 00h18m56s | time left: 220h55m42s epoch 0 | batch 1100 | examples/s: 9.7 | loss: 0.62540 | time elapsed: 00h19m23s | time left: 220h53m58s epoch 0 | batch 1125 | examples/s: 9.9 | loss: 0.56879 | time elapsed: 00h19m49s | time left: 220h46m35s epoch 0 | batch 1150 | examples/s: 9.9 | loss: 0.64508 | time elapsed: 00h20m15s | time left: 220h51m45s epoch 0 | batch 1175 | examples/s: 9.7 | loss: 0.62032 | time elapsed: 00h20m42s | time left: 220h47m26s epoch 0 | batch 1200 | examples/s: 9.8 | loss: 0.61268 | time elapsed: 00h21m07s | time left: 220h35m09s epoch 0 | batch 1225 | examples/s: 9.7 | loss: 0.54779 | time elapsed: 00h21m32s | time left: 220h21m37s epoch 0 | batch 1250 | examples/s: 9.8 | loss: 0.58781 | time elapsed: 00h21m58s | time left: 220h15m46s epoch 0 | batch 1275 | examples/s: 9.8 | loss: 0.53676 | time elapsed: 00h22m24s | time left: 220h09m56s epoch 0 | batch 1300 | examples/s: 9.7 | loss: 0.58683 | time elapsed: 00h22m49s | time left: 220h00m25s epoch 0 | batch 1325 | examples/s: 9.8 | loss: 0.57288 | time elapsed: 00h23m14s | time left: 219h45m54s epoch 0 | batch 1350 | examples/s: 9.9 | loss: 0.58358 | time elapsed: 00h23m39s | time left: 219h33m02s epoch 0 | batch 1375 | examples/s: 9.8 | loss: 0.55579 | time elapsed: 00h24m05s | time left: 219h28m57s epoch 0 | batch 1400 | examples/s: 9.8 | loss: 0.56870 | time elapsed: 00h24m31s | time left: 219h25m50s epoch 0 | batch 1425 | examples/s: 9.9 | loss: 0.57267 | time elapsed: 00h24m56s | time left: 219h21m08s epoch 0 | batch 1450 | examples/s: 9.8 | loss: 0.55499 | time elapsed: 00h25m22s | time left: 219h11m23s epoch 0 | batch 1475 | examples/s: 9.8 | loss: 0.59987 | time elapsed: 00h25m46s | time left: 218h54m29s epoch 0 | batch 1500 | examples/s: 9.8 | loss: 0.60035 | time elapsed: 00h26m11s | time left: 218h45m07s epoch 0 | batch 1525 | examples/s: 9.9 | loss: 0.54744 | time elapsed: 00h26m37s | time left: 218h42m20s epoch 0 | batch 1550 | examples/s: 10.1 | loss: 0.57234 | time elapsed: 00h27m03s | time left: 218h43m45s epoch 0 | batch 1575 | examples/s: 9.8 | loss: 0.63796 | time elapsed: 00h27m29s | time left: 218h38m49s epoch 0 | batch 1600 | examples/s: 9.9 | loss: 0.57050 | time elapsed: 00h27m56s | time left: 218h44m15s epoch 0 | batch 1625 | examples/s: 9.7 | loss: 0.57792 | time elapsed: 00h28m22s | time left: 218h38m28s epoch 0 | batch 1650 | examples/s: 9.8 | loss: 0.56862 | time elapsed: 00h28m48s | time left: 218h43m03s epoch 0 | batch 1675 | examples/s: 9.7 | loss: 0.59931 | time elapsed: 00h29m15s | time left: 218h43m42s epoch 0 | batch 1700 | examples/s: 9.9 | loss: 0.61965 | time elapsed: 00h29m42s | time left: 218h48m06s epoch 0 | batch 1725 | examples/s: 10.0 | loss: 0.61029 | time elapsed: 00h30m08s | time left: 218h46m26s epoch 0 | batch 1750 | examples/s: 9.7 | loss: 0.57544 | time elapsed: 00h30m34s | time left: 218h48m04s epoch 0 | batch 1775 | examples/s: 9.9 | loss: 0.55535 | time elapsed: 00h31m00s | time left: 218h47m14s epoch 0 | batch 1800 | examples/s: 9.9 | loss: 0.59221 | time elapsed: 00h31m28s | time left: 218h57m40s epoch 0 | batch 1825 | examples/s: 9.8 | loss: 0.56692 | time elapsed: 00h31m55s | time left: 219h05m25s epoch 0 | batch 1850 | examples/s: 9.7 | loss: 0.58602 | time elapsed: 00h32m22s | time left: 219h09m53s epoch 0 | batch 1875 | examples/s: 10.0 | loss: 0.60834 | time elapsed: 00h32m49s | time left: 219h09m43s epoch 0 | batch 1900 | examples/s: 10.1 | loss: 0.54720 | time elapsed: 00h33m15s | time left: 219h10m36s epoch 0 | batch 1925 | examples/s: 9.7 | loss: 0.60159 | time elapsed: 00h33m42s | time left: 219h10m48s epoch 0 | batch 1950 | examples/s: 9.8 | loss: 0.55886 | time elapsed: 00h34m08s | time left: 219h09m56s epoch 0 | batch 1975 | examples/s: 9.7 | loss: 0.58505 | time elapsed: 00h34m34s | time left: 219h10m22s epoch 0 | batch 2000 | examples/s: 9.8 | loss: 0.57191 | time elapsed: 00h35m01s | time left: 219h13m20s epoch 0 | batch 4000 | examples/s: 9.8 | loss: 0.60531 | time elapsed: 01h06m13s | time left: 206h43m55s

choyingw commented 11 months ago

Thank you for reporting this issue. It is a stabilization issue in distillation loss. I have fixed this and updated the code. You should be able to see some false-colored depth even with the model at the first epoch (just for sanity check). It should work for batch size = 4, but I found the loss may fluctuate more with a lower batch size.

2579690686 commented 11 months ago

Thank you very much, the model is already convergent.

Thanks again.