Closed 2579690686 closed 11 months ago
Hi, could you provide the loss curve or screenshot, and hyperparamters you set
Thank you very much for your reply!
python execute.py --exe train --model_name distdepth-distilled --frame_ids 0 -1 1 --log_dir='./tmp' --data_path D:\sim --dataset SimSIN --batch_size 4 --width 256 --height 256 --max_depth 10.0 --num_epochs 10 --scheduler_step_size 8 --learning_rate 0.0001 --thre 0.95 --num_layers 152 --log_frequency 25
I tried adjusting the learning rate, but it still fluctuated slightly.
Due to GPU limitations, my batch size can only go up to 6, will this be affected?
epoch 0 | batch 0 | examples/s: 0.6 | loss: 0.51127 | time elapsed: 00h00m07s | time left: 00h00m00s epoch 0 | batch 25 | examples/s: 8.4 | loss: 0.56632 | time elapsed: 00h00m36s | time left: 302h15m08s epoch 0 | batch 50 | examples/s: 10.1 | loss: 0.64960 | time elapsed: 00h01m03s | time left: 267h42m16s epoch 0 | batch 75 | examples/s: 10.0 | loss: 0.53189 | time elapsed: 00h01m31s | time left: 253h52m24s epoch 0 | batch 100 | examples/s: 10.0 | loss: 0.59236 | time elapsed: 00h01m57s | time left: 246h08m43s epoch 0 | batch 125 | examples/s: 9.8 | loss: 0.57938 | time elapsed: 00h02m24s | time left: 241h46m11s epoch 0 | batch 150 | examples/s: 9.8 | loss: 0.58601 | time elapsed: 00h02m50s | time left: 238h07m57s epoch 0 | batch 175 | examples/s: 9.6 | loss: 0.56136 | time elapsed: 00h03m17s | time left: 236h24m06s epoch 0 | batch 200 | examples/s: 8.7 | loss: 0.57973 | time elapsed: 00h03m44s | time left: 234h45m47s epoch 0 | batch 225 | examples/s: 9.9 | loss: 0.62872 | time elapsed: 00h04m10s | time left: 233h01m10s epoch 0 | batch 250 | examples/s: 9.6 | loss: 0.51845 | time elapsed: 00h04m36s | time left: 231h22m07s epoch 0 | batch 275 | examples/s: 9.7 | loss: 0.60010 | time elapsed: 00h05m01s | time left: 229h37m30s epoch 0 | batch 300 | examples/s: 9.7 | loss: 0.60359 | time elapsed: 00h05m28s | time left: 228h42m39s epoch 0 | batch 325 | examples/s: 9.7 | loss: 0.60311 | time elapsed: 00h05m54s | time left: 227h47m06s epoch 0 | batch 350 | examples/s: 9.7 | loss: 0.61501 | time elapsed: 00h06m19s | time left: 226h44m33s epoch 0 | batch 375 | examples/s: 9.6 | loss: 0.60289 | time elapsed: 00h06m45s | time left: 226h12m44s epoch 0 | batch 400 | examples/s: 9.6 | loss: 0.59890 | time elapsed: 00h07m11s | time left: 225h39m54s epoch 0 | batch 425 | examples/s: 9.9 | loss: 0.58350 | time elapsed: 00h07m37s | time left: 225h11m11s epoch 0 | batch 450 | examples/s: 9.8 | loss: 0.63595 | time elapsed: 00h08m03s | time left: 224h47m30s epoch 0 | batch 475 | examples/s: 9.7 | loss: 0.60160 | time elapsed: 00h08m29s | time left: 224h20m13s epoch 0 | batch 500 | examples/s: 9.6 | loss: 0.60722 | time elapsed: 00h08m55s | time left: 223h55m25s epoch 0 | batch 525 | examples/s: 9.1 | loss: 0.57381 | time elapsed: 00h09m22s | time left: 223h53m39s
Below are the results of my previous training, and the loss has been oscillating in the range of 0.5-0.6.
tensor(0.5789, device='cuda:0', grad_fn=
epoch 0 | batch 0 | examples/s: 0.6 | loss: 0.51127 | time elapsed: 00h00m07s | time left: 00h00m00s epoch 0 | batch 25 | examples/s: 8.4 | loss: 0.56632 | time elapsed: 00h00m36s | time left: 302h15m08s epoch 0 | batch 50 | examples/s: 10.1 | loss: 0.64960 | time elapsed: 00h01m03s | time left: 267h42m16s epoch 0 | batch 75 | examples/s: 10.0 | loss: 0.53189 | time elapsed: 00h01m31s | time left: 253h52m24s epoch 0 | batch 100 | examples/s: 10.0 | loss: 0.59236 | time elapsed: 00h01m57s | time left: 246h08m43s epoch 0 | batch 125 | examples/s: 9.8 | loss: 0.57938 | time elapsed: 00h02m24s | time left: 241h46m11s epoch 0 | batch 150 | examples/s: 9.8 | loss: 0.58601 | time elapsed: 00h02m50s | time left: 238h07m57s epoch 0 | batch 175 | examples/s: 9.6 | loss: 0.56136 | time elapsed: 00h03m17s | time left: 236h24m06s epoch 0 | batch 200 | examples/s: 8.7 | loss: 0.57973 | time elapsed: 00h03m44s | time left: 234h45m47s epoch 0 | batch 225 | examples/s: 9.9 | loss: 0.62872 | time elapsed: 00h04m10s | time left: 233h01m10s epoch 0 | batch 250 | examples/s: 9.6 | loss: 0.51845 | time elapsed: 00h04m36s | time left: 231h22m07s epoch 0 | batch 275 | examples/s: 9.7 | loss: 0.60010 | time elapsed: 00h05m01s | time left: 229h37m30s epoch 0 | batch 300 | examples/s: 9.7 | loss: 0.60359 | time elapsed: 00h05m28s | time left: 228h42m39s epoch 0 | batch 325 | examples/s: 9.7 | loss: 0.60311 | time elapsed: 00h05m54s | time left: 227h47m06s epoch 0 | batch 350 | examples/s: 9.7 | loss: 0.61501 | time elapsed: 00h06m19s | time left: 226h44m33s epoch 0 | batch 375 | examples/s: 9.6 | loss: 0.60289 | time elapsed: 00h06m45s | time left: 226h12m44s epoch 0 | batch 400 | examples/s: 9.6 | loss: 0.59890 | time elapsed: 00h07m11s | time left: 225h39m54s epoch 0 | batch 425 | examples/s: 9.9 | loss: 0.58350 | time elapsed: 00h07m37s | time left: 225h11m11s epoch 0 | batch 450 | examples/s: 9.8 | loss: 0.63595 | time elapsed: 00h08m03s | time left: 224h47m30s epoch 0 | batch 475 | examples/s: 9.7 | loss: 0.60160 | time elapsed: 00h08m29s | time left: 224h20m13s epoch 0 | batch 500 | examples/s: 9.6 | loss: 0.60722 | time elapsed: 00h08m55s | time left: 223h55m25s epoch 0 | batch 525 | examples/s: 9.1 | loss: 0.57381 | time elapsed: 00h09m22s | time left: 223h53m39s epoch 0 | batch 550 | examples/s: 10.0 | loss: 0.61983 | time elapsed: 00h09m48s | time left: 223h49m04s epoch 0 | batch 575 | examples/s: 9.4 | loss: 0.54653 | time elapsed: 00h10m14s | time left: 223h24m32s epoch 0 | batch 600 | examples/s: 9.9 | loss: 0.61790 | time elapsed: 00h10m41s | time left: 223h24m00s epoch 0 | batch 625 | examples/s: 9.8 | loss: 0.55278 | time elapsed: 00h11m07s | time left: 223h07m26s epoch 0 | batch 650 | examples/s: 9.9 | loss: 0.62031 | time elapsed: 00h11m34s | time left: 223h12m23s epoch 0 | batch 675 | examples/s: 9.7 | loss: 0.53466 | time elapsed: 00h11m59s | time left: 222h46m30s epoch 0 | batch 700 | examples/s: 9.8 | loss: 0.56861 | time elapsed: 00h12m25s | time left: 222h31m33s epoch 0 | batch 725 | examples/s: 9.9 | loss: 0.55777 | time elapsed: 00h12m51s | time left: 222h23m08s epoch 0 | batch 750 | examples/s: 9.9 | loss: 0.59033 | time elapsed: 00h13m17s | time left: 222h16m51s epoch 0 | batch 775 | examples/s: 9.8 | loss: 0.59776 | time elapsed: 00h13m44s | time left: 222h20m22s epoch 0 | batch 800 | examples/s: 9.8 | loss: 0.57714 | time elapsed: 00h14m09s | time left: 221h53m25s epoch 0 | batch 825 | examples/s: 9.9 | loss: 0.57735 | time elapsed: 00h14m34s | time left: 221h36m53s epoch 0 | batch 850 | examples/s: 9.9 | loss: 0.55686 | time elapsed: 00h15m00s | time left: 221h21m58s epoch 0 | batch 875 | examples/s: 9.7 | loss: 0.58286 | time elapsed: 00h15m27s | time left: 221h27m06s epoch 0 | batch 900 | examples/s: 9.9 | loss: 0.57009 | time elapsed: 00h15m54s | time left: 221h32m48s epoch 0 | batch 925 | examples/s: 9.9 | loss: 0.62446 | time elapsed: 00h16m19s | time left: 221h18m11s epoch 0 | batch 950 | examples/s: 9.8 | loss: 0.57562 | time elapsed: 00h16m45s | time left: 221h04m40s epoch 0 | batch 975 | examples/s: 9.8 | loss: 0.59325 | time elapsed: 00h17m11s | time left: 220h56m19s epoch 0 | batch 1000 | examples/s: 9.7 | loss: 0.57069 | time elapsed: 00h17m37s | time left: 221h01m29s epoch 0 | batch 1025 | examples/s: 10.0 | loss: 0.56339 | time elapsed: 00h18m03s | time left: 220h54m44s epoch 0 | batch 1050 | examples/s: 9.7 | loss: 0.51971 | time elapsed: 00h18m29s | time left: 220h50m31s epoch 0 | batch 1075 | examples/s: 9.7 | loss: 0.58701 | time elapsed: 00h18m56s | time left: 220h55m42s epoch 0 | batch 1100 | examples/s: 9.7 | loss: 0.62540 | time elapsed: 00h19m23s | time left: 220h53m58s epoch 0 | batch 1125 | examples/s: 9.9 | loss: 0.56879 | time elapsed: 00h19m49s | time left: 220h46m35s epoch 0 | batch 1150 | examples/s: 9.9 | loss: 0.64508 | time elapsed: 00h20m15s | time left: 220h51m45s epoch 0 | batch 1175 | examples/s: 9.7 | loss: 0.62032 | time elapsed: 00h20m42s | time left: 220h47m26s epoch 0 | batch 1200 | examples/s: 9.8 | loss: 0.61268 | time elapsed: 00h21m07s | time left: 220h35m09s epoch 0 | batch 1225 | examples/s: 9.7 | loss: 0.54779 | time elapsed: 00h21m32s | time left: 220h21m37s epoch 0 | batch 1250 | examples/s: 9.8 | loss: 0.58781 | time elapsed: 00h21m58s | time left: 220h15m46s epoch 0 | batch 1275 | examples/s: 9.8 | loss: 0.53676 | time elapsed: 00h22m24s | time left: 220h09m56s epoch 0 | batch 1300 | examples/s: 9.7 | loss: 0.58683 | time elapsed: 00h22m49s | time left: 220h00m25s epoch 0 | batch 1325 | examples/s: 9.8 | loss: 0.57288 | time elapsed: 00h23m14s | time left: 219h45m54s epoch 0 | batch 1350 | examples/s: 9.9 | loss: 0.58358 | time elapsed: 00h23m39s | time left: 219h33m02s epoch 0 | batch 1375 | examples/s: 9.8 | loss: 0.55579 | time elapsed: 00h24m05s | time left: 219h28m57s epoch 0 | batch 1400 | examples/s: 9.8 | loss: 0.56870 | time elapsed: 00h24m31s | time left: 219h25m50s epoch 0 | batch 1425 | examples/s: 9.9 | loss: 0.57267 | time elapsed: 00h24m56s | time left: 219h21m08s epoch 0 | batch 1450 | examples/s: 9.8 | loss: 0.55499 | time elapsed: 00h25m22s | time left: 219h11m23s epoch 0 | batch 1475 | examples/s: 9.8 | loss: 0.59987 | time elapsed: 00h25m46s | time left: 218h54m29s epoch 0 | batch 1500 | examples/s: 9.8 | loss: 0.60035 | time elapsed: 00h26m11s | time left: 218h45m07s epoch 0 | batch 1525 | examples/s: 9.9 | loss: 0.54744 | time elapsed: 00h26m37s | time left: 218h42m20s epoch 0 | batch 1550 | examples/s: 10.1 | loss: 0.57234 | time elapsed: 00h27m03s | time left: 218h43m45s epoch 0 | batch 1575 | examples/s: 9.8 | loss: 0.63796 | time elapsed: 00h27m29s | time left: 218h38m49s epoch 0 | batch 1600 | examples/s: 9.9 | loss: 0.57050 | time elapsed: 00h27m56s | time left: 218h44m15s epoch 0 | batch 1625 | examples/s: 9.7 | loss: 0.57792 | time elapsed: 00h28m22s | time left: 218h38m28s epoch 0 | batch 1650 | examples/s: 9.8 | loss: 0.56862 | time elapsed: 00h28m48s | time left: 218h43m03s epoch 0 | batch 1675 | examples/s: 9.7 | loss: 0.59931 | time elapsed: 00h29m15s | time left: 218h43m42s epoch 0 | batch 1700 | examples/s: 9.9 | loss: 0.61965 | time elapsed: 00h29m42s | time left: 218h48m06s epoch 0 | batch 1725 | examples/s: 10.0 | loss: 0.61029 | time elapsed: 00h30m08s | time left: 218h46m26s epoch 0 | batch 1750 | examples/s: 9.7 | loss: 0.57544 | time elapsed: 00h30m34s | time left: 218h48m04s epoch 0 | batch 1775 | examples/s: 9.9 | loss: 0.55535 | time elapsed: 00h31m00s | time left: 218h47m14s epoch 0 | batch 1800 | examples/s: 9.9 | loss: 0.59221 | time elapsed: 00h31m28s | time left: 218h57m40s epoch 0 | batch 1825 | examples/s: 9.8 | loss: 0.56692 | time elapsed: 00h31m55s | time left: 219h05m25s epoch 0 | batch 1850 | examples/s: 9.7 | loss: 0.58602 | time elapsed: 00h32m22s | time left: 219h09m53s epoch 0 | batch 1875 | examples/s: 10.0 | loss: 0.60834 | time elapsed: 00h32m49s | time left: 219h09m43s epoch 0 | batch 1900 | examples/s: 10.1 | loss: 0.54720 | time elapsed: 00h33m15s | time left: 219h10m36s epoch 0 | batch 1925 | examples/s: 9.7 | loss: 0.60159 | time elapsed: 00h33m42s | time left: 219h10m48s epoch 0 | batch 1950 | examples/s: 9.8 | loss: 0.55886 | time elapsed: 00h34m08s | time left: 219h09m56s epoch 0 | batch 1975 | examples/s: 9.7 | loss: 0.58505 | time elapsed: 00h34m34s | time left: 219h10m22s epoch 0 | batch 2000 | examples/s: 9.8 | loss: 0.57191 | time elapsed: 00h35m01s | time left: 219h13m20s epoch 0 | batch 4000 | examples/s: 9.8 | loss: 0.60531 | time elapsed: 01h06m13s | time left: 206h43m55s
Thank you for reporting this issue. It is a stabilization issue in distillation loss. I have fixed this and updated the code. You should be able to see some false-colored depth even with the model at the first epoch (just for sanity check). It should work for batch size = 4, but I found the loss may fluctuate more with a lower batch size.
Thank you very much, the model is already convergent.
Thanks again.
batchsize = 6, thanks