TRAILab / CaDDN

Categorical Depth Distribution Network for Monocular 3D Object Detection (CVPR 2021 Oral)
Apache License 2.0
366 stars 62 forks source link

when training, the loss and depth loss are not convergent #26

Closed Gaozihui closed 3 years ago

Gaozihui commented 3 years ago

Thanks for your brilliant work!

However, I print the loss and depth_loss when training and they have no tendency to converge. And I want to know what is wrong with my operation.

Thanks in advance for any help you can offer.

train: 55%|█████▌ | 1022/1856 [18:19<14:59, 1.08s/it, total_it=1021] epochs: 0%| | 0/80 [18:19<?, ?it/s, loss=10.9, lr=0.000101] loss is tensor(19.3046, device='cuda:0') loss_depth is 16.68878936767578

train: 55%|█████▌ | 1023/1856 [18:20<14:59, 1.08s/it, total_it=1022] epochs: 0%| | 0/80 [18:20<?, ?it/s, loss=19.3, lr=0.000101] loss is tensor(8.4508, device='cuda:0') loss_depth is 6.085319995880127

train: 55%|█████▌ | 1024/1856 [18:21<14:58, 1.08s/it, total_it=1023] epochs: 0%| | 0/80 [18:21<?, ?it/s, loss=8.45, lr=0.000101] loss is tensor(7.5888, device='cuda:0') loss_depth is 4.7100067138671875

train: 55%|█████▌ | 1025/1856 [18:22<14:56, 1.08s/it, total_it=1024] epochs: 0%| | 0/80 [18:22<?, ?it/s, loss=7.59, lr=0.000101] loss is tensor(7.5831, device='cuda:0') loss_depth is 5.188798904418945

train: 55%|█████▌ | 1026/1856 [18:23<14:56, 1.08s/it, total_it=1025] epochs: 0%| | 0/80 [18:23<?, ?it/s, loss=7.58, lr=0.000101] loss is tensor(11.0090, device='cuda:0') loss_depth is 8.783016204833984

train: 55%|█████▌ | 1027/1856 [18:24<14:55, 1.08s/it, total_it=1026] epochs: 0%| | 0/80 [18:24<?, ?it/s, loss=11, lr=0.000101] loss is tensor(9.2116, device='cuda:0') loss_depth is 6.269783020019531

train: 55%|█████▌ | 1028/1856 [18:25<14:55, 1.08s/it, total_it=1027] epochs: 0%| | 0/80 [18:26<?, ?it/s, loss=9.21, lr=0.000101] loss is tensor(7.4908, device='cuda:0') loss_depth is 4.79773473739624

train: 55%|█████▌ | 1029/1856 [18:26<14:52, 1.08s/it, total_it=1028] epochs: 0%| | 0/80 [18:27<?, ?it/s, loss=7.49, lr=0.000101] loss is tensor(17.2144, device='cuda:0') loss_depth is 14.055488586425781

train: 55%|█████▌ | 1030/1856 [18:27<14:52, 1.08s/it, total_it=1029] epochs: 0%| | 0/80 [18:28<?, ?it/s, loss=17.2, lr=0.000101] loss is tensor(7.1547, device='cuda:0') loss_depth is 4.631966590881348

train: 56%|█████▌ | 1031/1856 [18:28<14:50, 1.08s/it, total_it=1030] epochs: 0%| | 0/80 [18:29<?, ?it/s, loss=7.15, lr=0.000101] loss is tensor(14.0801, device='cuda:0') loss_depth is 10.272579193115234

train: 56%|█████▌ | 1032/1856 [18:29<14:50, 1.08s/it, total_it=1031] epochs: 0%| | 0/80 [18:30<?, ?it/s, loss=14.1, lr=0.000101] loss is tensor(8.0110, device='cuda:0') loss_depth is 4.475630283355713

train: 56%|█████▌ | 1033/1856 [18:30<14:48, 1.08s/it, total_it=1032] epochs: 0%| | 0/80 [18:31<?, ?it/s, loss=8.01, lr=0.000101] loss is tensor(10.5358, device='cuda:0') loss_depth is 7.642397880554199

train: 56%|█████▌ | 1034/1856 [18:32<14:49, 1.08s/it, total_it=1033] epochs: 0%| | 0/80 [18:32<?, ?it/s, loss=10.5, lr=0.000101] loss is tensor(10.3984, device='cuda:0') loss_depth is 7.659360408782959

train: 56%|█████▌ | 1035/1856 [18:33<14:48, 1.08s/it, total_it=1034] epochs: 0%| | 0/80 [18:33<?, ?it/s, loss=10.4, lr=0.000101] loss is tensor(7.7434, device='cuda:0') loss_depth is 4.804980278015137

train: 56%|█████▌ | 1036/1856 [18:34<14:46, 1.08s/it, total_it=1035] epochs: 0%| | 0/80 [18:34<?, ?it/s, loss=7.74, lr=0.000101] loss is tensor(11.9586, device='cuda:0') loss_depth is 9.924271583557129

codyreading commented 3 years ago

Hi Zihui!

These are just loss values over 14 iterations. I recommend plotting the loss values over a longer duration using tensorboard to look at the trends. The training script should already output a .tfevents file you can use to plot.

Additionally, since we are using a small batch size (I assume batch size of 2 here), the loss values will vary quite a bit between iterations. However, the trend over a longer duration should be downwards.

codyreading commented 3 years ago

Closing due to inactivity