A bug in my code - Githubissues

Loydian commented 2 years ago

When Implementing the SEW block in my own model, I met a problem as blow.

Traceback (most recent call last):
  File "/home/lyd/spikeformer/spikeformer.py", line 612, in <module>
    loss.backward()
  File "/home/snn/anaconda3/envs/torch17/lib/python3.6/site-packages/torch/tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/snn/anaconda3/envs/torch17/lib/python3.6/site-packages/torch/autograd/__init__.py", line 132, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cpu!

When debuging it, I found the bug is here.

def conv(in_channels, out_channels, kernel_size=(3, 3), padding=(1, 1), stride=(1, 1), bias=False, bn=True,
         act_layer=MultiStepParametricLIFNode):
    return nn.Sequential(
        layer.SeqToANNContainer(
            nn.Conv2d(in_channels, out_channels, kernel_size=kernel_size, padding=padding, stride=stride, bias=bias),
            nn.BatchNorm2d(out_channels) if bn else nn.Identity()
        ),
        act_layer(detach_reset=True),
        # nn.ReLU(inplace=True)
    )

When I use the spiking neuron, the error occurs. But when I replace it with a ReLU, it works well. And the training script works well with other ANN models. Because of the problem of intellectual property rights, I can't show the complete code. After reading the source code of spikingjelly, I still can't fix the bug.

fangwei123456 commented 2 years ago

Do you use the latest version of SJ? And do you move the whole network to cuda?

Loydian commented 2 years ago

Yes, I use the 0.0.0.10 version. And I am pretty sure the training script is right. When using other models or replace the act_layer with the ReLU, the code work well.

fangwei123456 commented 2 years ago

Replace nn.ReLU to torch.nn.PReLU and see if it raises an error.

Loydian commented 2 years ago

No. I just try it. Everything works fine.

fangwei123456 commented 2 years ago

Before loss.backward, print the PLIF's w.device and see if it is cuda:1.

Loydian commented 2 years ago

I have done this before. The parameter w's device is cuda:1. residual_function.1.1.w : torch.Size([]) cuda:1 shortcut.1.1.w : torch.Size([]) cuda:1

fangwei123456 commented 2 years ago

OK. Can you provide a minimul codes example to reproduce the error? You can remove your proposed model in these codes to avoid the intellectual property rights.

Loydian commented 2 years ago

When proposing this issue, I have tried to provide a minimul code. But the rest of the model is a new architecture. After removing the proposed block, the problem somehow disappear. And the orginal code of SEW resnet can work well with the same training script. I compared the implementation of the original SEW resnet and my convolution part, and didn't found where the bug is.

fangwei123456 / spikingjelly

A bug in my code #210