LiYunfengLYF / LightFC

MIT License
29 stars 1 forks source link

Problem about training. #5

Closed Xuchen-Li closed 7 months ago

Xuchen-Li commented 7 months ago

Hello, thank you for your wonderful work! When I train the model using the cmd "python tracking/train.py --script LightFC --config mobilnetv2_p_pwcorr_se_scf_sc_iab_sc_adj_concat_repn33_se_conv33_center_wiou --save_dir . --mode multiple --nproc_per_node 2", after 1 epoch, there is a problem, the error is "RuntimeError: Given groups=1, weight of size [64,64,1,1], expected input [32,512,1,1] to have 64 channels, but got 512 channels instead". It seems in ecm.py, line 34, x = self.fc1(x), the shape is not matched.

LiYunfengLYF commented 7 months ago

ecm.py中的SE模块是挤压激励模块,然后它本身是没有问题的,问题应该出在512个通道那里;您是在完成了一轮训练之后才报错的吗?还是第一轮训练就报错了?能将完整的报错发给我吗?

Xuchen-Li commented 7 months ago

您好,我是在完成了一轮训练之后才报错的,这个时候权重文件还没有保存下来,报错信息在这里,非常感谢! [train: 1, 1875 / 1875] FPS: 0.8 (0.7) , DataTime: 1.190 (0.005) , ForwardTime: 0.085 , TotalTime: 1.279 , Loss/total: 1.51487 , Loss/giou: 0.30231 , Loss/l1: 0.03324 , Loss/location: 0.74408 , mean_IoU: 0.69769 Epoch Time: 0:39:58.954812 Avg Data Time: 1.19018 Avg GPU Trans Time: 0.00461 Avg Forward Time: 0.08465 Training crashed at epoch 1 Traceback for the error! Traceback (most recent call last): File "E:\LightFC-main\lib\train../..\lib\train\trainers\base_trainer.py", line 92, in train self.train_epoch() File "E:\LightFC-main\lib\train../..\lib\train\trainers\ltr_trainer.py", line 139, in train_epoch self.cycle_dataset(loader) # 执行数据加载器的训练或验证周期。 File "E:\LightFC-main\lib\train../..\lib\train\trainers\ltr_trainer.py", line 90, in cycle_dataset loss, stats = self.actor(data) File "E:\LightFC-main\lib\train../..\lib\train\actors\lightfc.py", line 32, in call out_dict = self.forward_pass(data) File "E:\LightFC-main\lib\train../..\lib\train\actors\lightfc.py", line 51, in forward_pass out_dict = self.net(z=template_list, x=search_img) File "D:\Anaconda3\envs\lightfc\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl return forward_call(*input, kwargs) File "E:\LightFC-main\lib\train../..\lib\models\tracker_model.py", line 46, in forward return self.forward_tracking(z, x) File "E:\LightFC-main\lib\train../..\lib\models\tracker_model.py", line 57, in forward_tracking opt = self.fusion(z_feat, x) # 使用融合层对处理后的 z_feat 和 x 进行融合。 File "D:\Anaconda3\envs\lightfc\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl return forward_call(*input, *kwargs) File "E:\LightFC-main\lib\train../..\lib\models\lightfc\fusion\ecm.py", line 71, in forward corr = self.ca(self.pw_corr(z, x)) File "D:\Anaconda3\envs\lightfc\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl return forward_call(input, kwargs) File "E:\LightFC-main\lib\train../..\lib\models\lightfc\fusion\ecm.py", line 34, in forward x = self.fc1(x) File "D:\Anaconda3\envs\lightfc\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "D:\Anaconda3\envs\lightfc\lib\site-packages\torch\nn\modules\conv.py", line 463, in forward return self._conv_forward(input, self.weight, self.bias) File "D:\Anaconda3\envs\lightfc\lib\site-packages\torch\nn\modules\conv.py", line 459, in _conv_forward return F.conv2d(input, weight, bias, self.stride, RuntimeError: Given groups=1, weight of size [64, 64, 1, 1], expected input[32, 512, 1, 1] to have 64 channels, but got 512 channels instead

Restarting training from last epoch ... Finished training!

LiYunfengLYF commented 7 months ago

我找到了问题,这个是我之前为了方便部署,然后修改了一些模型,导致在评估时,传入ECM模块的是原始的图像信息而不是提取的特征,具体修改如下
class LightFC(nn.Module): def init(self, cfg, env_num=0, training=False, ): super(LightFC, self).init()

    if cfg.MODEL.BACKBONE.TYPE == 'MobileNetV2':
        self.backbone = MobileNetV2()
    elif cfg.MODEL.BACKBONE.TYPE == 'tiny_vit_5m_224':
        self.backbone = tiny_vit_5m_224()
    self.training = training
    if self.train:
        load_pretrain(self.backbone, env_num=env_num, training=training, cfg=cfg, mode=cfg.MODEL.BACKBONE.LOAD_MODE)

    self.fusion = pwcorr_se_scf_sc_iab_sc_concat(num_kernel=cfg.MODEL.FUSION.PARAMS.num_kernel,
                                                 adj_channel=cfg.MODEL.FUSION.PARAMS.adj_channel
                                                 )

    self.head = repn33_se_center_concat(inplanes=cfg.MODEL.HEAD.PARAMS.inplanes,
                                        channel=cfg.MODEL.HEAD.PARAMS.channel,
                                        feat_sz=cfg.MODEL.HEAD.PARAMS.feat_sz,
                                        stride=cfg.MODEL.HEAD.PARAMS.stride,
                                        freeze_bn=cfg.MODEL.HEAD.PARAMS.freeze_bn,
                                        )

def forward(self, z, x):
    if self.training:
        z = self.backbone(z)
        x = self.backbone(x)
        opt = self.fusion(z, x)
        out = self.head(opt)
    else:
        z = self.backbone(z) # 这里是要修改的地方
        return self.forward_tracking(z, x)
    return out

#
def forward_backbone(self, z):
    z = self.backbone(z)
    return z

def forward_tracking(self, z_feat, x):
    x = self.backbone(x)
    opt = self.fusion(z_feat, x)
    out = self.head(opt)
    return out
Xuchen-Li commented 7 months ago

非常感谢!

Xuchen-Li commented 7 months ago

可以正常训练了,再次感谢!