LiWentomng / BoxInstSeg

A toolbox for box-supervised instance segmentation.
Apache License 2.0
405 stars 36 forks source link

ValueError: matrix contains invalid numeric entries #6

Open zhaoyangwei123 opened 1 year ago

zhaoyangwei123 commented 1 year ago

Hello, @LiWentomng I tried to reproduce your paper box2mask, but I had the following problems and the model had a very large loss at the beginning of training. How to solve it?

2023-01-07 12:27:40,129 - mmdet - INFO - Iter [50/368750] lr: 5.000e-06, eta: 3 days, 14:44:25, time: 0.847, data_time: 0.050, memory: 6779, loss_cls: 9.3236, loss_project: 6.2381, loss_levelset: 0.0710, d0.loss_cls: 9.0557, d0.loss_project: 5.5436, d0.loss_levelset: 0.0670, d1.loss_cls: 9.3925, d1.loss_project: 5.5199, d1.loss_levelset: 0.0640, d2.loss_cls: 9.1847, d2.loss_project: 5.7577, d2.loss_levelset: 0.0549, d3.loss_cls: 9.3142, d3.loss_project: 5.8749, d3.loss_levelset: 0.0656, d4.loss_cls: 9.4000, d4.loss_project: 5.8713, d4.loss_levelset: 0.0596, d5.loss_cls: 9.0998, d5.loss_project: 6.2049, d5.loss_levelset: 0.0682, d6.loss_cls: 9.1544, d6.loss_project: 6.1733, d6.loss_levelset: 0.0779, d7.loss_cls: 9.0938, d7.loss_project: 6.3329, d7.loss_levelset: 0.0836, d8.loss_cls: 8.7211, d8.loss_project: 6.4827, d8.loss_levelset: 0.0856, loss: 152.4366, grad_norm: 307.3523

Traceback (most recent call last): File "./tools/train.py", line 242, in main() File "./tools/train.py", line 231, in main train_detector( File "/home/ubuntu/wzy/BoxInstSeg/mmdet/apis/train.py", line 244, in train_detector runner.run(data_loaders, cfg.workflow) File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 144, in run iter_runner(iter_loaders[i], kwargs) File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 64, in train outputs = self.model.train_step(data_batch, self.optimizer, kwargs) File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 63, in train_step output = self.module.train_step(inputs[0], kwargs[0]) File "/home/ubuntu/wzy/BoxInstSeg/mmdet/models/detectors/base.py", line 248, in train_step losses = self(data) File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(input, kwargs) File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 116, in new_func return old_func(args, kwargs) File "/home/ubuntu/wzy/BoxInstSeg/mmdet/models/detectors/base.py", line 172, in forward return self.forward_train(img, img_metas, kwargs) File "/home/ubuntu/wzy/BoxInstSeg/mmdet/models/detectors/maskformer.py", line 104, in forward_train losses = self.panoptic_head.forward_train(x, img_metas, gt_bboxes, File "/home/ubuntu/wzy/BoxInstSeg/mmdet/models/dense_heads/box2mask_head.py", line 440, in forward_train losses = self.loss(all_cls_scores, all_mask_preds, all_lst_feats,gt_labels, gt_masks, File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 205, in new_func return old_func(args, kwargs) File "/home/ubuntu/wzy/BoxInstSeg/mmdet/models/dense_heads/box2mask_head.py", line 203, in loss losses_cls, loss_project, loss_levelset = multi_apply( File "/home/ubuntu/wzy/BoxInstSeg/mmdet/core/utils/misc.py", line 30, in multi_apply return tuple(map(list, zip(map_results))) File "/home/ubuntu/wzy/BoxInstSeg/mmdet/models/dense_heads/box2mask_head.py", line 239, in loss_single num_total_pos,num_total_neg) = self.get_targets(cls_scores_list, mask_preds_list, File "/home/ubuntu/wzy/BoxInstSeg/mmdet/models/dense_heads/box2mask_head.py", line 142, in get_targets neg_inds_list) = multi_apply(self._get_target_single, cls_scores_list, File "/home/ubuntu/wzy/BoxInstSeg/mmdet/core/utils/misc.py", line 30, in multi_apply return tuple(map(list, zip(map_results))) File "/home/ubuntu/wzy/BoxInstSeg/mmdet/models/dense_heads/box2mask_head.py", line 167, in _get_target_single assign_result = self.assigner.assign(cls_score, mask_pred, File "/home/ubuntu/wzy/BoxInstSeg/mmdet/core/bbox/assigners/mask_hungarian_assigner.py", line 119, in assign matched_row_inds, matched_col_inds = linear_sum_assignment(cost) ValueError: matrix contains invalid numeric entries

LiWentomng commented 1 year ago

Hello@zhaoyangwei123 The large loss is normal for box2mask. I upload my training log file for coco (r-101). You can refer to it.

I didn't encounter the above problem. It seems the problem of assigner. Are you training for the COCO or your dataset? I have test the codes and configs, which are normal for COCO and VOC.

zhaoyangwei123 commented 1 year ago

@LiWentomng I am training for the coco on 8 NVIDIA RTX2080TI GPU. So I changed the image size from (1024, 1024) to (800, 800) with batch=1 and num_workers=0. I don't know if it's because I've changed these parameters.

LiWentomng commented 1 year ago

@zhaoyangwei123 I suggest you firstly try VOC with RTX2080TI GPU. VOC needs the less GPU memory with less training time. The VOC link with coco-format annotaions is here.

I guess that batch_size=1 may incur this problem. I will check this problem.

LiWentomng commented 1 year ago

@zhaoyangwei123 I have fixed this issue. When batch_size=1, the loss values will appear nan value. You can try the current codes. Please note when batch_size=1, the learning rate lr and training stepand max_iters (50e by default) need to be changed proportionally. Any further questions can be disscuessed.

zhaoyangwei123 commented 1 year ago

@LiWentomng Thank you very much for your answer, but when I run your new code, I have the following problem: Traceback (most recent call last): File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/mmcv/utils/registry.py", line 69, in build_from_cfg return obj_cls(**args) File "/home/ubuntu/wzy/BoxInstSeg/BoxInstSeg-main/mmdet/datasets/pipelines/transforms.py", line 767, in init assert crop_size[0] > 0 and crop_size[1] > 0 TypeError: '>' not supported between instances of 'tuple' and 'int'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/mmcv/utils/registry.py", line 69, in build_from_cfg return obj_cls(**args) File "/home/ubuntu/wzy/BoxInstSeg/BoxInstSeg-main/mmdet/datasets/custom.py", line 129, in init self.pipeline = Compose(pipeline) File "/home/ubuntu/wzy/BoxInstSeg/BoxInstSeg-main/mmdet/datasets/pipelines/compose.py", line 23, in init transform = build_from_cfg(transform, PIPELINES) File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/mmcv/utils/registry.py", line 72, in build_from_cfg raise type(e)(f'{obj_cls.name}: {e}') TypeError: RandomCrop: '>' not supported between instances of 'tuple' and 'int'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "tools/train.py", line 242, in main() File "tools/train.py", line 218, in main datasets = [build_dataset(cfg.data.train)] File "/home/ubuntu/wzy/BoxInstSeg/BoxInstSeg-main/mmdet/datasets/builder.py", line 82, in build_dataset dataset = build_from_cfg(cfg, DATASETS, default_args) File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/mmcv/utils/registry.py", line 72, in build_from_cfg raise type(e)(f'{obj_cls.name}: {e}') TypeError: CocoDataset: RandomCrop: '>' not supported between instances of 'tuple' and 'int'

I verified boxlevelset and boxinst, both work fine, so I think there may be some errors in the box2mask code you uploaded.

LiWentomng commented 1 year ago

@zhaoyangwei123 When did this erro appear? At the starting or during training process?
I have test the code and config with 800x800 and bs=1, and the training work fine. According to the reporting error, the format of image size is right as image_size = (800, 800) in your config ? Can you share your config information?

zhaoyangwei123 commented 1 year ago

@LiWentomng Hello, my error came at the beginning of the training and I have the following config,image_size = (1024,1024), samples_per_gpu=1, workers_per_gpu=0, lr=0.00005. The other configuration is unchanged. Because I found that there are errors reported on multiple GPUs, I considered solving the problem on a single GPU first. On a single 2080TI, the image size can be changed without change. I located the error in line 767 of transforms.py