Closed Xuchen-Li closed 7 months ago
ecm.py中的SE模块是挤压激励模块,然后它本身是没有问题的,问题应该出在512个通道那里;您是在完成了一轮训练之后才报错的吗?还是第一轮训练就报错了?能将完整的报错发给我吗?
您好,我是在完成了一轮训练之后才报错的,这个时候权重文件还没有保存下来,报错信息在这里,非常感谢! [train: 1, 1875 / 1875] FPS: 0.8 (0.7) , DataTime: 1.190 (0.005) , ForwardTime: 0.085 , TotalTime: 1.279 , Loss/total: 1.51487 , Loss/giou: 0.30231 , Loss/l1: 0.03324 , Loss/location: 0.74408 , mean_IoU: 0.69769 Epoch Time: 0:39:58.954812 Avg Data Time: 1.19018 Avg GPU Trans Time: 0.00461 Avg Forward Time: 0.08465 Training crashed at epoch 1 Traceback for the error! Traceback (most recent call last): File "E:\LightFC-main\lib\train../..\lib\train\trainers\base_trainer.py", line 92, in train self.train_epoch() File "E:\LightFC-main\lib\train../..\lib\train\trainers\ltr_trainer.py", line 139, in train_epoch self.cycle_dataset(loader) # 执行数据加载器的训练或验证周期。 File "E:\LightFC-main\lib\train../..\lib\train\trainers\ltr_trainer.py", line 90, in cycle_dataset loss, stats = self.actor(data) File "E:\LightFC-main\lib\train../..\lib\train\actors\lightfc.py", line 32, in call out_dict = self.forward_pass(data) File "E:\LightFC-main\lib\train../..\lib\train\actors\lightfc.py", line 51, in forward_pass out_dict = self.net(z=template_list, x=search_img) File "D:\Anaconda3\envs\lightfc\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl return forward_call(*input, kwargs) File "E:\LightFC-main\lib\train../..\lib\models\tracker_model.py", line 46, in forward return self.forward_tracking(z, x) File "E:\LightFC-main\lib\train../..\lib\models\tracker_model.py", line 57, in forward_tracking opt = self.fusion(z_feat, x) # 使用融合层对处理后的 z_feat 和 x 进行融合。 File "D:\Anaconda3\envs\lightfc\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl return forward_call(*input, *kwargs) File "E:\LightFC-main\lib\train../..\lib\models\lightfc\fusion\ecm.py", line 71, in forward corr = self.ca(self.pw_corr(z, x)) File "D:\Anaconda3\envs\lightfc\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl return forward_call(input, kwargs) File "E:\LightFC-main\lib\train../..\lib\models\lightfc\fusion\ecm.py", line 34, in forward x = self.fc1(x) File "D:\Anaconda3\envs\lightfc\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "D:\Anaconda3\envs\lightfc\lib\site-packages\torch\nn\modules\conv.py", line 463, in forward return self._conv_forward(input, self.weight, self.bias) File "D:\Anaconda3\envs\lightfc\lib\site-packages\torch\nn\modules\conv.py", line 459, in _conv_forward return F.conv2d(input, weight, bias, self.stride, RuntimeError: Given groups=1, weight of size [64, 64, 1, 1], expected input[32, 512, 1, 1] to have 64 channels, but got 512 channels instead
Restarting training from last epoch ... Finished training!
我找到了问题,这个是我之前为了方便部署,然后修改了一些模型,导致在评估时,传入ECM模块的是原始的图像信息而不是提取的特征,具体修改如下
class LightFC(nn.Module):
def init(self, cfg, env_num=0, training=False, ):
super(LightFC, self).init()
if cfg.MODEL.BACKBONE.TYPE == 'MobileNetV2':
self.backbone = MobileNetV2()
elif cfg.MODEL.BACKBONE.TYPE == 'tiny_vit_5m_224':
self.backbone = tiny_vit_5m_224()
self.training = training
if self.train:
load_pretrain(self.backbone, env_num=env_num, training=training, cfg=cfg, mode=cfg.MODEL.BACKBONE.LOAD_MODE)
self.fusion = pwcorr_se_scf_sc_iab_sc_concat(num_kernel=cfg.MODEL.FUSION.PARAMS.num_kernel,
adj_channel=cfg.MODEL.FUSION.PARAMS.adj_channel
)
self.head = repn33_se_center_concat(inplanes=cfg.MODEL.HEAD.PARAMS.inplanes,
channel=cfg.MODEL.HEAD.PARAMS.channel,
feat_sz=cfg.MODEL.HEAD.PARAMS.feat_sz,
stride=cfg.MODEL.HEAD.PARAMS.stride,
freeze_bn=cfg.MODEL.HEAD.PARAMS.freeze_bn,
)
def forward(self, z, x):
if self.training:
z = self.backbone(z)
x = self.backbone(x)
opt = self.fusion(z, x)
out = self.head(opt)
else:
z = self.backbone(z) # 这里是要修改的地方
return self.forward_tracking(z, x)
return out
#
def forward_backbone(self, z):
z = self.backbone(z)
return z
def forward_tracking(self, z_feat, x):
x = self.backbone(x)
opt = self.fusion(z_feat, x)
out = self.head(opt)
return out
非常感谢!
可以正常训练了,再次感谢!
Hello, thank you for your wonderful work! When I train the model using the cmd "python tracking/train.py --script LightFC --config mobilnetv2_p_pwcorr_se_scf_sc_iab_sc_adj_concat_repn33_se_conv33_center_wiou --save_dir . --mode multiple --nproc_per_node 2", after 1 epoch, there is a problem, the error is "RuntimeError: Given groups=1, weight of size [64,64,1,1], expected input [32,512,1,1] to have 64 channels, but got 512 channels instead". It seems in ecm.py, line 34, x = self.fc1(x), the shape is not matched.