weight mismatch - Githubissues

Senwang98 commented 2 years ago

@ArchipLab-LinfengZhang Hi, When using code you provide, I meet weight load mismatch problem. Since x101-32d.pth's url is unavailable, I just download cascade_mask_rcnn_x101_32x4d_fpn_dconv_c3-c5_1x_coco-e75f90c8 Teacher's backbone is set to have no pretrained weight.

_base_ = './cascade_mask_rcnn_r50_fpn_1x_coco.py'
model = dict(
    # pretrained='open-mmlab://resnext101_32x4d',
    backbone=dict(
        type='ResNeXt',
        depth=101,
        groups=32,
        base_width=4,
        num_stages=4,
        out_indices=(0, 1, 2, 3),
        frozen_stages=1,
        norm_cfg=dict(type='BN', requires_grad=True),
        style='pytorch'))

build_teacher() function:

def build_teacher():
    teacher_cfg = Config.fromfile("configs/dcn/cascade_mask_rcnn_x101_32x4d_fpn_dconv_c3-c5_1x_coco.py")
    teacher = build_detector(
        teacher_cfg.model, train_cfg=teacher_cfg.train_cfg, test_cfg=teacher_cfg.test_cfg)
    load_checkpoint(teacher,
                    # "mmdetection/checkpoints/cascade_mask_rcnn_x101_32x4d_fpn_dconv_c3-c5_1x_c.pth",
                    "chechpoints/cascade_mask_rcnn_x101_32x4d_fpn_dconv_c3-c5_1x_coco-e75f90c8.pth",
                    map_location='cpu')
    return teacher

But I think I have replace all thing you mentioned, I still meet teacher's weight mismatch problem as following:

2021-12-10 20:50:37,810 - mmdet - INFO - load model from: torchvision://resnet50
2021-12-10 20:50:37,940 - mmdet - WARNING - The model and loaded state dict do not match exactly

size mismatch for layer1.0.conv1.weight: copying a param with shape torch.Size([64, 64, 1, 1]) from checkpoint, the shape in current model is torch.Size([128, 64, 1, 1]).
size mismatch for layer1.0.bn1.weight: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for layer1.0.bn1.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for layer1.0.bn1.running_mean: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([128]).

Have you met similiar problem?

Shawnnnnn commented 2 years ago

我debug了几天，换了几个mmdet版本都跑不通。。模型加载是没问题了，build_teacher里面还要套一层MMDistributedDataParallel，不然会报参数list(Tensor)对不上DataContainer，但是又给我报错TypeError: kd_feat_loss is not a tensor or list of tensors

Shawnnnnn commented 2 years ago

losses: {'kd_feat_loss': 0, 'kd_channel_loss': 0, 'kd_spatial_loss': 0, 'kd_nonlocal_loss': 0.0, 'loss_rpn_cls': [tensor(0.4318, device='cuda:0', grad_fn=<MulBackward0>), tensor(0.1646, device='cuda:0', grad_fn=<MulBackward0>), tensor(0.0469, device='cuda:0', grad_fn=<MulBackward0>), tensor(0.0213, device='cuda:0', grad_fn=<MulBackward0>), tensor(0.0296, device='cuda:0', grad_fn=<MulBackward0>)], 'loss_rpn_bbox': [tensor(0.0399, device='cuda:0', grad_fn=<MulBackward0>), tensor(0.0577, device='cuda:0', grad_fn=<MulBackward0>), tensor(0.0251, device='cuda:0', grad_fn=<MulBackward0>), tensor(0.0113, device='cuda:0', grad_fn=<MulBackward0>), tensor(0.0429, device='cuda:0', grad_fn=<MulBackward0>)], 'loss_cls': tensor(4.4297, device='cuda:0', grad_fn=<MulBackward0>), 'acc': tensor([0.], device='cuda:0'), 'loss_bbox': tensor(0.0252, device='cuda:0', grad_fn=<MulBackward0>)}

他给的几个loss都是值，mmdet里面self._parse_losses(losses)需要解析Tensor或者list

Senwang98 commented 2 years ago

@Shawnnnnn 在CWD上的基础上实现这个ICLR的论文吧，思路反正简单，没必要在语法上纠结，而且这个工作没那么简洁，点现在看不是很好，不影响对KD-det领域的把握

ArchipLab-LinfengZhang / Object-Detection-Knowledge-Distillation-ICLR2021

weight mismatch #10