替换backbone为clip - Githubissues

Betricy commented 3 months ago

你好，我替换你的vit特征提取部分为clip，然后进行训练，效果map减小一半，作者可以帮我看看有什么问题吗？ `class build_transformer(nn.Module): def init(self, num_classes, cfg, camera_num, view_num): super(build_transformer, self).init() self.model_name = 'ViT-B-16' self.neck = cfg.MODEL.NECK self.neck_feat = cfg.TEST.NECK_FEAT self.in_planes = 768 self.in_planes_proj = 512 self.num_classes = num_classes self.camera_num = camera_num self.view_num = view_num self.sie_coe = cfg.MODEL.SIE_COE
self.ID_LOSS_TYPE = cfg.MODEL.ID_LOSS_TYPE self.norm = nn.LayerNorm(self.in_planes) self.classifier = nn.Linear(self.in_planes, self.num_classes, bias=False) self.classifier.apply(weights_init_classifier) self.classifier_proj = nn.Linear(self.in_planes_proj, self.num_classes, bias=False) self.classifier_proj.apply(weights_init_classifier)

    self.bottleneck = nn.BatchNorm1d(self.in_planes)
    self.bottleneck.bias.requires_grad_(False)
    self.bottleneck.apply(weights_init_kaiming)
    self.bottleneck_proj = nn.BatchNorm1d(self.in_planes_proj)
    self.bottleneck_proj.bias.requires_grad_(False)
    self.bottleneck_proj.apply(weights_init_kaiming)

    self.h_resolution = int((cfg.INPUT.SIZE_TRAIN[0]-16)//cfg.MODEL.STRIDE_SIZE[0] + 1)
    self.w_resolution = int((cfg.INPUT.SIZE_TRAIN[1]-16)//cfg.MODEL.STRIDE_SIZE[1] + 1)
    self.vision_stride_size = cfg.MODEL.STRIDE_SIZE[0]
    clip_model = load_clip_to_cpu(self.model_name, self.h_resolution, self.w_resolution, self.vision_stride_size)
    clip_model.to("cuda")

    self.image_encoder = clip_model.visual

    if cfg.MODEL.SIE_CAMERA:
        self.cv_embed = nn.Parameter(torch.zeros(camera_num, self.in_planes))
        trunc_normal_(self.cv_embed, std=.02)
        print('camera number is : {}'.format(camera_num))

    dataset_name = cfg.DATASETS.NAMES
    self.prompt_learner = PromptLearner(num_classes, dataset_name, clip_model.dtype, clip_model.token_embedding)
    self.text_encoder = TextEncoder(clip_model)

def forward(self, x = None, label=None, cam_label= None, view_label=None):

    if self.model_name == 'ViT-B-16':
        if cam_label != None:
            cv_embed = self.sie_coe * self.cv_embed[cam_label]#64,768
        else:
            cv_embed = None
        image_features_last, image_features, image_features_proj = self.image_encoder(x, cv_embed)
        image_features = self.norm(image_features)
        # img_feature_last = image_features_last[:,0]
        img_feature = image_features[:,0]
        # img_feature_proj = image_features_proj[:,0]

    feat = self.bottleneck(img_feature) 
    # feat_proj = self.bottleneck_proj(img_feature_proj)
    if self.training:
        if self.ID_LOSS_TYPE in ('arcface', 'cosface', 'amsoftmax', 'circle'):
            cls_score = self.classifier(feat, label)
        else:
            cls_score = self.classifier(feat)
        return image_features, cls_score, img_feature  # global feature for triplet loss
    else:
        if self.neck_feat == 'after':
            return image_features, feat
        else:
            return image_features, img_feature

def load_param(self, trained_path):
    param_dict = torch.load(trained_path)
    for i in param_dict:
        self.state_dict()[i.replace('module.', '')].copy_(param_dict[i])
    print('Loading pretrained model from {}'.format(trained_path))

def load_param_finetune(self, model_path):
    param_dict = torch.load(model_path)
    for i in param_dict:
        self.state_dict()[i].copy_(param_dict[i])
    print('Loading pretrained model for finetuning from {}'.format(model_path))

class clip_TOPReID(nn.Module): def init(self, num_classes, camera_num, view_num, cfg): super(clip_TOPReID, self).init()

    self.NI = build_transformer(num_classes, cfg, camera_num, view_num)
    self.TI = build_transformer(num_classes, cfg, camera_num, view_num)
    self.RGB = build_transformer(num_classes, cfg, camera_num, view_num)

    self.num_classes = num_classes
    self.cfg = cfg
    self.camera = camera_num
    self.view = view_num
    self.num_head = 12
    self.mix_dim = 768

    self.TPM = TPM(dim=self.mix_dim, num_heads=self.num_head)
    self.re = cfg.MODEL.RE
    if self.re:
        self.CRM = CRM(dim=self.mix_dim, num_heads=self.num_head, miss=cfg.TEST.MISS,
                       depth=cfg.MODEL.RE_LAYER)#补充重建模块
    self.neck = cfg.MODEL.NECK
    self.neck_feat = cfg.TEST.NECK_FEAT
    self.ID_LOSS_TYPE = cfg.MODEL.ID_LOSS_TYPE
    self.layer = cfg.MODEL.LAYER
    self.direct = cfg.MODEL.DIRECT

    self.classifier_TPM = nn.Linear(3 * self.mix_dim, self.num_classes, bias=False)
    self.classifier_TPM.apply(weights_init_classifier)
    self.bottleneck_TPM = nn.BatchNorm1d(3 * self.mix_dim)
    self.bottleneck_TPM.bias.requires_grad_(False)
    self.bottleneck_TPM.apply(weights_init_kaiming)

    self.classifier_ViT = nn.Linear(3 * self.mix_dim, self.num_classes, bias=False)
    self.classifier_ViT.apply(weights_init_classifier)
    self.bottleneck_ViT = nn.BatchNorm1d(3 * self.mix_dim)
    self.bottleneck_ViT.bias.requires_grad_(False)
    self.bottleneck_ViT.apply(weights_init_kaiming)

    self.miss = cfg.TEST.MISS

def load_param(self, trained_path):
    param_dict = torch.load(trained_path)
    for i in param_dict:
        self.state_dict()[i.replace('module.', '')].copy_(param_dict[i])
    print('Loading pretrained model from {}'.format(trained_path))

def forward(self, x, label=None,  cam_label=None, view_label=None):
    if self.training:
        RGB = x['RGB']
        NI = x['NI']
        TI = x['TI']
        NI_cash, NI_score, NI_global = self.NI(NI, cam_label=cam_label, view_label=view_label)
        TI_cash, TI_score, TI_global = self.TI(TI, cam_label=cam_label, view_label=view_label)
        RGB_cash, RGB_score, RGB_global = self.RGB(RGB, cam_label=cam_label, view_label=view_label)

        ori = torch.cat([RGB_global, NI_global, TI_global], dim=-1)#64,2304
        ori_global = self.bottleneck_ViT(ori)
        ori_score = self.classifier_ViT(ori_global)

        # TPM_feature = self.TPM(RGB_cash[self.layer], NI_cash[self.layer], TI_cash[self.layer])#三种模态交换之后cat
        TPM_feature = self.TPM(RGB_cash, NI_cash, TI_cash)  # 三种模态交换之后cat
        if self.re:
            # loss_re = self.CRM(RGB_cash[self.layer], NI_cash[self.layer], TI_cash[self.layer])
            loss_re = self.CRM(RGB_cash, NI_cash, TI_cash)
        TPM_global = self.bottleneck_TPM(TPM_feature)
        TPM_score = self.classifier_TPM(TPM_global)
        if self.re:
            if self.direct:
                return TPM_score, TPM_feature, ori_score, ori, loss_re
            else:
                return TPM_score, TPM_feature, RGB_score, RGB_global, NI_score, NI_global, TI_score, TI_global, loss_re
        else:
            if self.direct:
                return TPM_score, TPM_feature, ori_score, ori
            else:
                return TPM_score, TPM_feature, RGB_score, RGB_global, NI_score, NI_global, TI_score, TI_global

    else:
        RGB = x['RGB']
        NI = x['NI']
        TI = x['TI']
        NI_cash, NI_global = self.NI(NI, cam_label=cam_label, view_label=view_label)
        TI_cash, TI_global = self.TI(TI, cam_label=cam_label, view_label=view_label)
        RGB_cash, RGB_global = self.RGB(RGB, cam_label=cam_label, view_label=view_label)
        TPM_feature = self.TPM(RGB_cash, NI_cash, TI_cash)
        if self.re:
            if self.miss == 'r':
                RGB = self.CRM(ma=None, mb=NI_cash[self.layer], mc=TI_cash[self.layer])
                TPM_feature = self.TPM(RGB, NI_cash[self.layer], TI_cash[self.layer])
            elif self.miss == "n":
                NI = self.CRM(ma=RGB_cash[self.layer], mb=None, mc=TI_cash[self.layer])
                TPM_feature = self.TPM(RGB_cash[self.layer], NI, TI_cash[self.layer])
            elif self.miss == 't':
                TI = self.CRM(ma=RGB_cash[self.layer], mb=NI_cash[self.layer], mc=None)
                TPM_feature = self.TPM(RGB_cash[self.layer], NI_cash[self.layer], TI)
            elif self.miss == 'rn':
                RGB, NI = self.CRM(ma=None, mb=None, mc=TI_cash[self.layer])
                TPM_feature = self.TPM(RGB, NI, TI_cash[self.layer])
            elif self.miss == 'rt':
                RGB, TI = self.CRM(ma=None, mb=NI_cash[self.layer], mc=None)
                TPM_feature = self.TPM(RGB, NI_cash[self.layer], TI)
            elif self.miss == 'nt':
                NI, TI = self.CRM(ma=RGB_cash[self.layer], mb=None, mc=None)
                TPM_feature = self.TPM(RGB_cash[self.layer], NI, TI)

        TPM_global = self.bottleneck_TPM(TPM_feature)
        if self.neck_feat == 'after':
            pass
        else:
            TPM_global = TPM_feature
        return torch.cat([TPM_global], dim=-1)

`

924973292 commented 3 months ago

It is likely an issue with the optimizer and learning rate. The pre-trained parameters of CLIP cannot withstand the high learning rate used in TOP-ReID. You can refer to the settings in CLIP-ReID and adjust to the configuration used in the second stage of fine-tuning.

This might help!

Betricy commented 3 months ago

这可能是优化器和学习率的问题。CLIP 的预训练参数无法承受 TOP-ReID 中使用的高学习率。您可以参考 CLIP-ReID 中的设置并调整到第二阶段微调中使用的配置。这可能会有所帮助！

MAP达到40%是在我修改了lr=0.00035的条件下得到的，而且这种小的学习率收敛非常慢

924973292 commented 3 months ago

What is your batch size? You can adjust it to: batch_size: 64 num_instance: 4 Give it a try.

Betricy commented 3 months ago

你的batch size是多少？你可以调整为： batch_size: 64 num_instance: 4 试试看。一张3090 batch_size: 64 num_instance: 8 我会试一试，4

924973292 commented 3 months ago

Additionally, you should note whether the learning rate you set is for the backbone only or for the entire network. For the CLIP backbone, you need to reduce the learning rate, while for CRM and TPM, the learning rate should be set normally, otherwise convergence will be very slow!

924973292 commented 3 months ago

I think you may set the backbone to 0.000005, while for other parts are 0.00035.

Betricy commented 3 months ago

另外，你要注意你设置的学习率是针对骨干网还是针对整个网络。对于 CLIP 骨干网，你需要降低学习率，而对于 CRM 和 TPM，学习率应该设置正常，否则收敛会很慢！

如何设置分层学习率呢？另外我注意到三种模态是共享主干，如果单独主干效果会怎样呢？作者有尝试过吗？

924973292 commented 3 months ago

1.You can set different learning rates according to the parameter name, for example:

if "clip..." in key: lr = 0.000005 else: ...

2.Additionally, the results in the TOP-ReID are based on the three modalities not sharing the backbone. I also tried sharing the backbone, and the results can be seen at the bottom of the TOP-ReID README. The results show that the impact is minimal.

Betricy commented 3 months ago

1.可以根据参数名设置不同的学习率，例如： if "clip..." in key: lr = 0.000005 else: ...

2.另外，TOP-ReID 中的结果是基于三种模式不共享骨干网络的。我也尝试过共享骨干网络，结果可以在 TOP-ReID README 的底部看到。结果表明影响微乎其微。

抱歉，我理解错了代码以为是共享的，感谢

924973292 / TOP-ReID

替换backbone为clip #7