FedCP在IoT数据集上训练报错

Joey010 commented 5 months ago

FedCP在IoT数据集（HAR和PAMAP2）上训练时报错，报错信息如下：

============= Running time: 0th =============
Creating server and clients ...
HARCNN(
  (conv1): Sequential(
    (0): Conv2d(9, 32, kernel_size=(1, 9), stride=(1, 1))
    (1): ReLU()
    (2): MaxPool2d(kernel_size=(1, 2), stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (conv2): Sequential(
    (0): Conv2d(32, 64, kernel_size=(1, 9), stride=(1, 1))
    (1): ReLU()
    (2): MaxPool2d(kernel_size=(1, 2), stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (fc): Sequential(
    (0): Linear(in_features=3712, out_features=1024, bias=True)
    (1): ReLU()
    (2): Linear(in_features=1024, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=12, bias=True)
  )
)

Join ratio / total clients: 1.0 / 9
Finished creating server and clients.

-------------Round number: 0-------------
Traceback (most recent call last):
  File "main.py", line 540, in <module>
    run(args)
  File "main.py", line 358, in run
    server.train()
  File "/Projects/PFLlib/system/flcore/servers/servercp.py", line 100, in train
    self.evaluate()
  File "/Projects/PFLlib/system/flcore/servers/servercp.py", line 78, in evaluate
    stats = self.test_metrics()
  File "/Projects/PFLlib/system/flcore/servers/serverbase.py", line 219, in test_metrics
    ct, ns, auc = c.test_metrics()
  File "/Projects/PFLlib/system/flcore/clients/clientcp.py", line 97, in test_metrics
    output = self.model(x, is_rep=False, context=self.context)
  File "/home/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Projects/PFLlib/system/flcore/clients/clientcp.py", line 224, in forward
    gate_in = rep * self.context
RuntimeError: The size of tensor a (3712) must match the size of tensor b (64) at non-singleton dimension 1

TsingZ0 commented 5 months ago

当使用HARCNN模型时，clientcp.py代码中in_dim = list(args.model.base.parameters())[-1].shape[0]不再适用，需要手动设置为in_dim = 1664或in_dim = 3712

也可以采用GPFL的clientgpfl.py代码self.feature_dim = list(self.model.head.parameters())[0].shape[1]

Joey010 commented 5 months ago

按照您的回复修改后，又出现了新的报错信息：

Traceback (most recent call last):
  File "main.py", line 573, in <module>
    run(args)
  File "main.py", line 383, in run
    server.train()
  File "/Projects/FedBML/system/flcore/servers/servercp.py", line 100, in train
    self.evaluate()
  File "/Projects/FedBML/system/flcore/servers/servercp.py", line 78, in evaluate
    stats = self.test_metrics()
  File "/Projects/FedBML/system/flcore/servers/serverbase.py", line 239, in test_metrics
    ct, ns, auc = c.test_metrics()
  File "/Projects/FedBML/system/flcore/clients/clientcp.py", line 99, in test_metrics
    output = self.model(x, is_rep=False, context=self.context)
  File "/home/wangpengju/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Projects/FedBML/system/flcore/clients/clientcp.py", line 226, in forward
    rep_p, rep_g = self.gate(rep, self.tau, self.hard, gate_in, self.flag)
  File "/home/wangpengju/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Projects/FedBML/system/flcore/clients/clientcp.py", line 252, in forward
    pm, gm = self.cs(context, tau=tau, hard=hard)
  File "/home/wangpengju/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Projects/FedBML/system/flcore/servers/servercp.py", line 194, in forward
    x = self.fc(x)
  File "/home/wangpengju/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wangpengju/.local/lib/python3.8/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/home/wangpengju/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wangpengju/.local/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (10x3712 and 64x128)

TsingZ0 commented 5 months ago

其实之所以原有代码会报错，是因为FedCP的代码默认采用最后一个FC层作为head，所以不适用于HARCNN。由于不同模型的代码实现不同，无法使用完全统一的代码，所以在某些实验场景中，代码需要针对具体的模型做相应调整。

一般来说，需要做的修改除了上述的in_dim以外，还得修改其他对head部分的处理代码，比如set_head_g函数中headw_p的获得就不能再直接使用head.weight.data.clone()，因为HARCNN中的head有多个FC层（有多个weight matrix），此时需要使用matmul将head中的所有FC层的weight按顺序相乘，化为一个weight matrix作为context生成的基础。

参考：

    def set_head_g(self, head):
        headw_ps = []
        for name, mat in self.model.model.head.named_parameters():
            if 'weight' in name:
                headw_ps.append(mat.data)
        headw_p = headw_ps[-1]
        for mat in headw_ps[-2::-1]:
            headw_p = torch.matmul(headw_p, mat)
        headw_p.detach_()
        self.context = torch.sum(headw_p, dim=0, keepdim=True)

        for new_param, old_param in zip(head.parameters(), self.model.head_g.parameters()):
            old_param.data = new_param.data.clone()

对于HARCNN，作了上述两步修改即可消除报错。目前已在代码中更新

Joey010 commented 5 months ago

非常感谢，完美的解决了所有问题

TsingZ0 / PFLlib

FedCP在IoT数据集上训练报错 #177