when training, the loss and depth loss are not convergent

Gaozihui commented 3 years ago

Thanks for your brilliant work!

However, I print the loss and depth_loss when training and they have no tendency to converge. And I want to know what is wrong with my operation.

Thanks in advance for any help you can offer.

train: 55%|█████▌ | 1022/1856 [18:19<14:59, 1.08s/it, total_it=1021] epochs: 0%| | 0/80 [18:19<?, ?it/s, loss=10.9, lr=0.000101] loss is tensor(19.3046, device='cuda:0') loss_depth is 16.68878936767578

train: 55%|█████▌ | 1023/1856 [18:20<14:59, 1.08s/it, total_it=1022] epochs: 0%| | 0/80 [18:20<?, ?it/s, loss=19.3, lr=0.000101] loss is tensor(8.4508, device='cuda:0') loss_depth is 6.085319995880127

train: 55%|█████▌ | 1024/1856 [18:21<14:58, 1.08s/it, total_it=1023] epochs: 0%| | 0/80 [18:21<?, ?it/s, loss=8.45, lr=0.000101] loss is tensor(7.5888, device='cuda:0') loss_depth is 4.7100067138671875

train: 55%|█████▌ | 1025/1856 [18:22<14:56, 1.08s/it, total_it=1024] epochs: 0%| | 0/80 [18:22<?, ?it/s, loss=7.59, lr=0.000101] loss is tensor(7.5831, device='cuda:0') loss_depth is 5.188798904418945

train: 55%|█████▌ | 1026/1856 [18:23<14:56, 1.08s/it, total_it=1025] epochs: 0%| | 0/80 [18:23<?, ?it/s, loss=7.58, lr=0.000101] loss is tensor(11.0090, device='cuda:0') loss_depth is 8.783016204833984

train: 55%|█████▌ | 1027/1856 [18:24<14:55, 1.08s/it, total_it=1026] epochs: 0%| | 0/80 [18:24<?, ?it/s, loss=11, lr=0.000101] loss is tensor(9.2116, device='cuda:0') loss_depth is 6.269783020019531

train: 55%|█████▌ | 1028/1856 [18:25<14:55, 1.08s/it, total_it=1027] epochs: 0%| | 0/80 [18:26<?, ?it/s, loss=9.21, lr=0.000101] loss is tensor(7.4908, device='cuda:0') loss_depth is 4.79773473739624

train: 55%|█████▌ | 1029/1856 [18:26<14:52, 1.08s/it, total_it=1028] epochs: 0%| | 0/80 [18:27<?, ?it/s, loss=7.49, lr=0.000101] loss is tensor(17.2144, device='cuda:0') loss_depth is 14.055488586425781

train: 55%|█████▌ | 1030/1856 [18:27<14:52, 1.08s/it, total_it=1029] epochs: 0%| | 0/80 [18:28<?, ?it/s, loss=17.2, lr=0.000101] loss is tensor(7.1547, device='cuda:0') loss_depth is 4.631966590881348

train: 56%|█████▌ | 1031/1856 [18:28<14:50, 1.08s/it, total_it=1030] epochs: 0%| | 0/80 [18:29<?, ?it/s, loss=7.15, lr=0.000101] loss is tensor(14.0801, device='cuda:0') loss_depth is 10.272579193115234

train: 56%|█████▌ | 1032/1856 [18:29<14:50, 1.08s/it, total_it=1031] epochs: 0%| | 0/80 [18:30<?, ?it/s, loss=14.1, lr=0.000101] loss is tensor(8.0110, device='cuda:0') loss_depth is 4.475630283355713

train: 56%|█████▌ | 1033/1856 [18:30<14:48, 1.08s/it, total_it=1032] epochs: 0%| | 0/80 [18:31<?, ?it/s, loss=8.01, lr=0.000101] loss is tensor(10.5358, device='cuda:0') loss_depth is 7.642397880554199

train: 56%|█████▌ | 1034/1856 [18:32<14:49, 1.08s/it, total_it=1033] epochs: 0%| | 0/80 [18:32<?, ?it/s, loss=10.5, lr=0.000101] loss is tensor(10.3984, device='cuda:0') loss_depth is 7.659360408782959

train: 56%|█████▌ | 1035/1856 [18:33<14:48, 1.08s/it, total_it=1034] epochs: 0%| | 0/80 [18:33<?, ?it/s, loss=10.4, lr=0.000101] loss is tensor(7.7434, device='cuda:0') loss_depth is 4.804980278015137

train: 56%|█████▌ | 1036/1856 [18:34<14:46, 1.08s/it, total_it=1035] epochs: 0%| | 0/80 [18:34<?, ?it/s, loss=7.74, lr=0.000101] loss is tensor(11.9586, device='cuda:0') loss_depth is 9.924271583557129

codyreading commented 3 years ago

Hi Zihui!

These are just loss values over 14 iterations. I recommend plotting the loss values over a longer duration using tensorboard to look at the trends. The training script should already output a .tfevents file you can use to plot.

Additionally, since we are using a small batch size (I assume batch size of 2 here), the loss values will vary quite a bit between iterations. However, the trend over a longer duration should be downwards.

codyreading commented 3 years ago

Closing due to inactivity

TRAILab / CaDDN

when training, the loss and depth loss are not convergent #26