Open zhaoyangwei123 opened 1 year ago
Hello@zhaoyangwei123 The large loss is normal for box2mask. I upload my training log file for coco (r-101). You can refer to it.
I didn't encounter the above problem. It seems the problem of assigner. Are you training for the COCO or your dataset? I have test the codes and configs, which are normal for COCO and VOC.
@LiWentomng I am training for the coco on 8 NVIDIA RTX2080TI GPU. So I changed the image size from (1024, 1024) to (800, 800) with batch=1 and num_workers=0. I don't know if it's because I've changed these parameters.
@zhaoyangwei123 I suggest you firstly try VOC with RTX2080TI GPU. VOC needs the less GPU memory with less training time. The VOC link with coco-format annotaions is here.
I guess that batch_size=1
may incur this problem. I will check this problem.
@zhaoyangwei123
I have fixed this issue. When batch_size=1, the loss values will appear nan
value.
You can try the current codes. Please note when batch_size=1, the learning rate lr
and training step
and max_iters
(50e by default) need to be changed proportionally.
Any further questions can be disscuessed.
@LiWentomng Thank you very much for your answer, but when I run your new code, I have the following problem: Traceback (most recent call last): File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/mmcv/utils/registry.py", line 69, in build_from_cfg return obj_cls(**args) File "/home/ubuntu/wzy/BoxInstSeg/BoxInstSeg-main/mmdet/datasets/pipelines/transforms.py", line 767, in init assert crop_size[0] > 0 and crop_size[1] > 0 TypeError: '>' not supported between instances of 'tuple' and 'int'
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/mmcv/utils/registry.py", line 69, in build_from_cfg return obj_cls(**args) File "/home/ubuntu/wzy/BoxInstSeg/BoxInstSeg-main/mmdet/datasets/custom.py", line 129, in init self.pipeline = Compose(pipeline) File "/home/ubuntu/wzy/BoxInstSeg/BoxInstSeg-main/mmdet/datasets/pipelines/compose.py", line 23, in init transform = build_from_cfg(transform, PIPELINES) File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/mmcv/utils/registry.py", line 72, in build_from_cfg raise type(e)(f'{obj_cls.name}: {e}') TypeError: RandomCrop: '>' not supported between instances of 'tuple' and 'int'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "tools/train.py", line 242, in
I verified boxlevelset and boxinst, both work fine, so I think there may be some errors in the box2mask code you uploaded.
@zhaoyangwei123
When did this erro appear? At the starting or during training process?
I have test the code and config with 800x800
and bs=1
, and the training work fine.
According to the reporting error, the format of image size is right as image_size = (800, 800)
in your config ?
Can you share your config information?
@LiWentomng Hello, my error came at the beginning of the training and I have the following config,image_size = (1024,1024), samples_per_gpu=1, workers_per_gpu=0, lr=0.00005. The other configuration is unchanged. Because I found that there are errors reported on multiple GPUs, I considered solving the problem on a single GPU first. On a single 2080TI, the image size can be changed without change. I located the error in line 767 of transforms.py
Hello, @LiWentomng I tried to reproduce your paper box2mask, but I had the following problems and the model had a very large loss at the beginning of training. How to solve it?
2023-01-07 12:27:40,129 - mmdet - INFO - Iter [50/368750] lr: 5.000e-06, eta: 3 days, 14:44:25, time: 0.847, data_time: 0.050, memory: 6779, loss_cls: 9.3236, loss_project: 6.2381, loss_levelset: 0.0710, d0.loss_cls: 9.0557, d0.loss_project: 5.5436, d0.loss_levelset: 0.0670, d1.loss_cls: 9.3925, d1.loss_project: 5.5199, d1.loss_levelset: 0.0640, d2.loss_cls: 9.1847, d2.loss_project: 5.7577, d2.loss_levelset: 0.0549, d3.loss_cls: 9.3142, d3.loss_project: 5.8749, d3.loss_levelset: 0.0656, d4.loss_cls: 9.4000, d4.loss_project: 5.8713, d4.loss_levelset: 0.0596, d5.loss_cls: 9.0998, d5.loss_project: 6.2049, d5.loss_levelset: 0.0682, d6.loss_cls: 9.1544, d6.loss_project: 6.1733, d6.loss_levelset: 0.0779, d7.loss_cls: 9.0938, d7.loss_project: 6.3329, d7.loss_levelset: 0.0836, d8.loss_cls: 8.7211, d8.loss_project: 6.4827, d8.loss_levelset: 0.0856, loss: 152.4366, grad_norm: 307.3523
Traceback (most recent call last): File "./tools/train.py", line 242, in
main()
File "./tools/train.py", line 231, in main
train_detector(
File "/home/ubuntu/wzy/BoxInstSeg/mmdet/apis/train.py", line 244, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 144, in run
iter_runner(iter_loaders[i], kwargs)
File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 64, in train
outputs = self.model.train_step(data_batch, self.optimizer, kwargs)
File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 63, in train_step
output = self.module.train_step(inputs[0], kwargs[0])
File "/home/ubuntu/wzy/BoxInstSeg/mmdet/models/detectors/base.py", line 248, in train_step
losses = self(data)
File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(input, kwargs)
File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 116, in new_func
return old_func(args, kwargs)
File "/home/ubuntu/wzy/BoxInstSeg/mmdet/models/detectors/base.py", line 172, in forward
return self.forward_train(img, img_metas, kwargs)
File "/home/ubuntu/wzy/BoxInstSeg/mmdet/models/detectors/maskformer.py", line 104, in forward_train
losses = self.panoptic_head.forward_train(x, img_metas, gt_bboxes,
File "/home/ubuntu/wzy/BoxInstSeg/mmdet/models/dense_heads/box2mask_head.py", line 440, in forward_train
losses = self.loss(all_cls_scores, all_mask_preds, all_lst_feats,gt_labels, gt_masks,
File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 205, in new_func
return old_func(args, kwargs)
File "/home/ubuntu/wzy/BoxInstSeg/mmdet/models/dense_heads/box2mask_head.py", line 203, in loss
losses_cls, loss_project, loss_levelset = multi_apply(
File "/home/ubuntu/wzy/BoxInstSeg/mmdet/core/utils/misc.py", line 30, in multi_apply
return tuple(map(list, zip(map_results)))
File "/home/ubuntu/wzy/BoxInstSeg/mmdet/models/dense_heads/box2mask_head.py", line 239, in loss_single
num_total_pos,num_total_neg) = self.get_targets(cls_scores_list, mask_preds_list,
File "/home/ubuntu/wzy/BoxInstSeg/mmdet/models/dense_heads/box2mask_head.py", line 142, in get_targets
neg_inds_list) = multi_apply(self._get_target_single, cls_scores_list,
File "/home/ubuntu/wzy/BoxInstSeg/mmdet/core/utils/misc.py", line 30, in multi_apply
return tuple(map(list, zip(map_results)))
File "/home/ubuntu/wzy/BoxInstSeg/mmdet/models/dense_heads/box2mask_head.py", line 167, in _get_target_single
assign_result = self.assigner.assign(cls_score, mask_pred,
File "/home/ubuntu/wzy/BoxInstSeg/mmdet/core/bbox/assigners/mask_hungarian_assigner.py", line 119, in assign
matched_row_inds, matched_col_inds = linear_sum_assignment(cost)
ValueError: matrix contains invalid numeric entries