WeiZongqi / CG-Net

Learning Calibrated-Guidance for Object Detection in Aerial Images
Apache License 2.0
58 stars 15 forks source link

Training Loss Error #1

Open j18567260 opened 3 years ago

j18567260 commented 3 years ago

在单卡训练的过程中,当训练到第一个epoch 450/38298时,损失值异常,直到后400batch报bbox错误。是不是数据集生成的代码有问题? 2021-04-08 14:26:24,187 - INFO - Epoch [1][400/38298] lr: 0.00692, eta: 4 days, 5:44:23, time: 0.798, data_time: 0.030, memory: 4559, loss_rpn_cls: 0.3144, loss_rpn_bbox: 0.1515, s0.rbbox_loss_cls: 0.4531, s0.rbbox_acc: 92.7930, s0.rbbox_loss_bbox: 0.5924, s1.rbbox_loss_cls: 0.2729, s1.rbbox_acc: 95.7763, s1.rbbox_loss_bbox: 0.1261, loss: 1.9104 2021-04-08 14:27:03,769 - INFO - Epoch [1][450/38298] lr: 0.00746, eta: 4 days, 5:38:36, time: 0.792, data_time: 0.027, memory: 4559, loss_rpn_cls: 1.1329, loss_rpn_bbox: 3.9938, s0.rbbox_loss_cls: 0.3731, s0.rbbox_acc: 94.3412, s0.rbbox_loss_bbox: 0.3807, s1.rbbox_loss_cls: 0.3043, s1.rbbox_acc: 95.6289, s1.rbbox_loss_bbox: 0.0709, loss: 6.2557 2021-04-08 14:27:43,211 - INFO - Epoch [1][500/38298] lr: 0.00799, eta: 4 days, 5:31:41, time: 0.789, data_time: 0.030, memory: 4559, loss_rpn_cls: 4606.1320, loss_rpn_bbox: 411.3401, s0.rbbox_loss_cls: 3.2093, s0.rbbox_acc: 95.5460, s0.rbbox_loss_bbox: 1.5279, s1.rbbox_loss_cls: 3.1706, s1.rbbox_acc: 96.1591, s1.rbbox_loss_bbox: 0.0434, loss: 5025.4235 2021-04-08 14:28:28,953 - INFO - Epoch [1][550/38298] lr: 0.00800, eta: 4 days, 6:53:34, time: 0.915, data_time: 0.139, memory: 4804, loss_rpn_cls: 2756.4105, loss_rpn_bbox: 772.6908, s0.rbbox_loss_cls: 2.1745, s0.rbbox_acc: 93.6437, s0.rbbox_loss_bbox: 1.4731, s1.rbbox_loss_cls: 2.1121, s1.rbbox_acc: 93.7372, s1.rbbox_loss_bbox: 0.0369, loss: 3534.8980 2021-04-08 14:29:06,726 - INFO - Epoch [1][600/38298] lr: 0.00800, eta: 4 days, 6:20:04, time: 0.755, data_time: 0.020, memory: 4804, loss_rpn_cls: 22709.9212, loss_rpn_bbox: 2486.1367, s0.rbbox_loss_cls: 12.9946, s0.rbbox_acc: 94.3856, s0.rbbox_loss_bbox: 5.9148, s1.rbbox_loss_cls: 17.1465, s1.rbbox_acc: 94.4870, s1.rbbox_loss_bbox: 0.2170, loss: 25232.3292 2021-04-08 14:29:44,850 - INFO - Epoch [1][650/38298] lr: 0.00800, eta: 4 days, 5:55:45, time: 0.762, data_time: 0.033, memory: 4804, loss_rpn_cls: 112816.2896, loss_rpn_bbox: 60489.4470, s0.rbbox_loss_cls: 92.8384, s0.rbbox_acc: 83.4779, s0.rbbox_loss_bbox: 22.8278, s1.rbbox_loss_cls: 109.9155, s1.rbbox_acc: 83.5014, s1.rbbox_loss_bbox: 1.9195, loss: 173533.2350 2021-04-08 14:30:21,997 - INFO - Epoch [1][700/38298] lr: 0.00800, eta: 4 days, 5:24:08, time: 0.743, data_time: 0.020, memory: 4804, loss_rpn_cls: 17055451.9681, loss_rpn_bbox: 28939940.7239, s0.rbbox_loss_cls: 1998.5494, s0.rbbox_acc: 85.2132, s0.rbbox_loss_bbox: 360.3875, s1.rbbox_loss_cls: 2116.1039, s1.rbbox_acc: 85.2131, s1.rbbox_loss_bbox: 42.7391, loss: 45999909.1724 2021-04-08 14:30:58,686 - INFO - Epoch [1][750/38298] lr: 0.00800, eta: 4 days, 4:52:00, time: 0.734, data_time: 0.020, memory: 4804, loss_rpn_cls: 610655327.4717, loss_rpn_bbox: 877621758.9251, s0.rbbox_loss_cls: 4449.5225, s0.rbbox_acc: 84.1080, s0.rbbox_loss_bbox: 1000.2603, s1.rbbox_loss_cls: 4128.5755, s1.rbbox_acc: 81.5336, s1.rbbox_loss_bbox: 79.6218, loss: 1488286705.3515 2021-04-08 14:31:36,322 - INFO - Epoch [1][800/38298] lr: 0.00800, eta: 4 days, 4:32:51, time: 0.753, data_time: 0.031, memory: 4804, loss_rpn_cls: 3780733863668.9443, loss_rpn_bbox: 4202940688343.9102, s0.rbbox_loss_cls: 191388.0878, s0.rbbox_acc: 83.0183, s0.rbbox_loss_bbox: 89285.3539, s1.rbbox_loss_cls: 480514.5821, s1.rbbox_acc: 83.1815, s1.rbbox_loss_bbox: 2610.6784, loss: 7983675289758.8213 Traceback (most recent call last): File "tools/train.py", line 97, in main() File "tools/train.py", line 93, in main logger=logger) File "/home/gen/PycharmProjects/CG-Net-master/mmdet/apis/train.py", line 61, in train_detector _non_dist_train(model, dataset, cfg, validate=validate) File "/home/gen/PycharmProjects/CG-Net-master/mmdet/apis/train.py", line 219, in _non_dist_train runner.run(data_loaders, cfg.workflow, cfg.total_epochs) File "/home/gen/anaconda3/envs/cgnet/lib/python3.6/site-packages/mmcv/runner/runner.py", line 384, in run epoch_runner(data_loaders[i], kwargs) File "/home/gen/anaconda3/envs/cgnet/lib/python3.6/site-packages/mmcv/runner/runner.py", line 283, in train self.model, data_batch, train_mode=True, kwargs) File "/home/gen/PycharmProjects/CG-Net-master/mmdet/apis/train.py", line 39, in batch_processor losses = model(data) File "/home/gen/anaconda3/envs/cgnet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, *kwargs) File "/home/gen/anaconda3/envs/cgnet/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward return self.module(inputs[0], kwargs[0]) File "/home/gen/anaconda3/envs/cgnet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, kwargs) File "/home/gen/PycharmProjects/CG-Net-master/mmdet/models/detectors/base_new.py", line 95, in forward return self.forward_train(img, img_meta, kwargs) File "/home/gen/PycharmProjects/CG-Net-master/mmdet/models/detectors/RoITransformer.py", line 223, in forward_train gt_labels[i]) File "/home/gen/PycharmProjects/CG-Net-master/mmdet/core/bbox/assigners/max_iou_assigner_rbbox.py", line 73, in assign raise ValueError('No gt or bboxes') ValueError: No gt or bboxes

WeiZongqi commented 3 years ago

做数据的文件没问题,通过查看loss,应该是梯度爆炸的问题,我的环境是2卡1080ti,建议尝试调大batchsize或调小学习率试一下

kongyan66 commented 3 years ago

@j18567260 你好,请问你训练遇到这个问题吗? TypeError: logger must be a logging.Logger object, but got <class 'str'>

hengseuer commented 3 years ago

@j18567260 你好,请问你训练遇到这个问题吗? TypeError: logger must be a logging.Logger object, but got <class 'str'>

use mmcv==0.4.0

kongyan66 commented 3 years ago

@ahaheng @WeiZongqi @j18567260 环境问题解决了,使用DOTA数据集并转为coco格式,可是训练时候报错: image 这个有遇到吗?