Training Questions - Githubissues

Hi, thanks for your great work! Currently, I want to give it a try on a lidar-only model but encounter a strange problem: when I train the model, it finished the first 20-30 iters quickly but then gets stuck in the rest iterations.

2022-07-11 14:49:09,578 - mmdet - INFO - Epoch [1][23/15435]    lr: 1.000e-04, eta: 3 days, 8:59:41, time: 0.571, data_time: 0.006, memory: 3081, loss_heatmap: 63.3409, layer_-1_loss_cls: 5.5666, layer_-1_loss_bbox: 7.2338, matched_ious: 0.0035, loss: 76.1413, grad_norm: 650.1203
2022-07-11 14:49:10,203 - mmdet - INFO - Epoch [1][24/15435]    lr: 1.000e-04, eta: 3 days, 7:51:04, time: 0.625, data_time: 0.007, memory: 3081, loss_heatmap: 40.5755, layer_-1_loss_cls: 3.8930, layer_-1_loss_bbox: 8.2220, matched_ious: 0.0013, loss: 52.6905, grad_norm: 427.7450
2022-07-11 14:53:49,345 - mmdet - INFO - Epoch [1][25/15435]    lr: 1.000e-04, eta: 43 days, 2:02:12, time: 279.142, data_time: 278.522, memory: 3081, loss_heatmap: 63.4586, layer_-1_loss_cls: 7.2327, layer_-1_loss_bbox: 8.9833, matched_ious: 0.0044, loss: 79.6746, grad_norm: 703.8561
2022-07-11 14:54:58,492 - mmdet - INFO - Epoch [1][26/15435]    lr: 1.000e-04, eta: 50 days, 22:17:40, time: 69.147, data_time: 68.535, memory: 3081, loss_heatmap: 61.2396, layer_-1_loss_cls: 7.7423, layer_-1_loss_bbox: 7.6271, matched_ious: 0.0023, loss: 76.6090, grad_norm: 649.6351

From the eta variation, you can imagine how much time it takes in one iteration :) The code is running on a V 100 machine with torch1.9. Have you ever met this problem?

XuyangBai / TransFusion

Training Questions #40