在使用test.py在运行作者昨天上传的best的训练模型报错

hurryup186 commented 7 months ago

您好，在运行您昨天上传的Trained models来复现最好指标效果时，无论是sysu还是regdb均发生了类似报错，请问这是哪里出现了问题吗？ self.class.name, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for build_model: size mismatch for prompt_learner.cls_ctx_rgb: copying a param with shape torch.Size([206, 4, 512]) from checkpoint, the shape in current model is torch.Size([754, 4, 512]). size mismatch for prompt_learner.cls_ctx_ir: copying a param with shape torch.Size([206, 4, 512]) from checkpoint, the shape in current model is torch.Size([486, 4, 512]).

LqhThird commented 7 months ago

我也遇到了上面相同的问题，而且在进行prepare训练的时候指标也有些低，训练50个epoch后map还没有达到6个点。我进行的唯一修改是加入多卡并行，就是do_train_stage函数的开始部分加入如下代码，不知道这样做有没有问题 if device: model.to(0) if torch.cuda.device_count() > 1: print('Using {} GPUs for training'.format(torch.cuda.device_count())) model = nn.DataParallel(model)

LqhThird commented 7 months ago

您好，在运行您昨天上传的Trained models来复现最好指标效果时，无论是sysu还是regdb均发生了类似报错，请问这是哪里出现了问题吗？ self.class.name, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for build_model: size mismatch for prompt_learner.cls_ctx_rgb: copying a param with shape torch.Size([206, 4, 512]) from checkpoint, the shape in current model is torch.Size([754, 4, 512]). size mismatch for prompt_learner.cls_ctx_ir: copying a param with shape torch.Size([206, 4, 512]) from checkpoint, the shape in current model is torch.Size([486, 4, 512]).

问题出现的原因是，模型存储了text_feature。在prepare阶段这个值似乎比较小是206，但是加载模型时用的是伪标签的ID数，可见光和红外的维度分别是754,486。因此text.py部分没法加载model_prepare_sysu.pth的参数，你如果替换成训练后的参数就可以。不过我现在的问题是，就算加载训练完成的best_sysu.pth，也没法在test.py部分取得较好的结果，不知道是不是我哪一步弄错了

CzAngus commented 7 months ago

代码在多卡DP上运行是有问题的，应该是涉及到两个memory那一部分的梯度backward部分，你可以试一下DDP。实验中我们都采用了单卡。

hurryup186 commented 7 months ago

代码在多卡DP上运行是有问题的，应该是涉及到两个memory那一部分的梯度backward部分，你可以试一下DDP。实验中我们都采用了单卡。

你好，我的是单卡，但仍旧存在issue中提出的问题

LqhThird commented 7 months ago

我在用下面的指令进行单卡测试的时候，效果也不理想，不太确定是什么原因 CUDA_VISIBLE_DEVICES=0 python test.py --dataset 'sysu' --resume_path save/checkpoints/model_best_sysu.pth

CzAngus commented 7 months ago

@LqhThird ，抱歉，可能由于之前上传文件的疏忽，书写了错误的模型链接，我刚刚检查了一下，重新上传了SYSU的最优模型，你可以下载测试一下。感谢你的指出！

CzAngus commented 7 months ago

@zhangyifeng186 ，你可以参考一下@LqhThird的回答哈，perpare部分以及模型是为了训练出一个能很好聚类RGB以及IR的模型，以便我们更精确的为一个人生成文本描述，如果你想测试model_perpare_sysu.pth，你可以在model建立之前（model = build_model(args, n_color_class, n_thermal_class)加入： n_color_class = 395 n_thermal_class = 395 规定好文本描述的个数，这样模型加载就不会出现你所说的问题。

LqhThird commented 7 months ago

抱歉，我感觉链接可能还是有问题哈。就是我换成了prepare_sysu.pth，修改n_color_class = 395，n_thermal_class = 395可以跑出正常结果了，说明数据集路径是没有问题的。但是换用model_best_sysu.pth后得分还是很低，于是我又试着下载了两次新链接，还是同样的结果，能麻烦您再确认下吗

CzAngus commented 7 months ago

我测试了没问题，我又重新上传了一份，你试试这个： https://drive.google.com/file/d/1l8gdLREaPjgPKQE9--h6dJpYil8a-ruU/view?usp=drive_link

hurryup186 commented 7 months ago

谢谢作者

hurryup186 commented 7 months ago

抱歉，虽然在作者的指导下通过修改这两个参数n_color_class ，n_thermal_class ，prepare_sysu.pth和最新的model_best_sysu_test.pth均可以复现得到最好结果，我感觉链接regdb可能还是有问题哈。虽然修改n_color_class = 206，n_thermal_class = 206。但是换用model_best_regdb_trial1.pth后得分还是很低，于是我又试着下载了两次链接，还是同样的结果，能麻烦您再确认下吗

CzAngus commented 7 months ago

model_best_regdb_trial1.pth不需要修改n_color_class = 206哈，这个是最优模型，不是perpare模型。你需要的是下载对应RegDB的伪标签到RegDB dataset中以及下载最优RegDB模型，之后CUDA_VISIBLE_DEVICES=0 python test.py --dataset 'regdb' --resume_path 'checkpoints/model_best_regdb_trial1.pth'便可以得到结果。

我确认过下载链接了，没有问题，后续不再回复哈。

CzAngus / CCLNet

在使用test.py在运行作者昨天上传的best的训练模型报错 #6