神经元可能导致torch反向传播出现错误

Met4physics commented 1 year ago

Issue type

[x] Bug Report
[ ] Feature Request
[x] Help wanted
[ ] Other

SpikingJelly version

latest

Description

我使用spikingjelly将unet做成了snn版本，我的完整的代码在这里。其中spiking_unet.py是出现问题的模型，而test.py是结构与spiking_unet相同的ann-unet。在main.py中有训练的pipeline。

首先，运行spiking_unet时，我遇到了如下报错，告诉我反向传播时出错

Traceback (most recent call last):
  File "/root/DRIVE/main.py", line 147, in <module>
    loss.backward()
  File "/root/miniconda3/lib/python3.8/site-packages/torch/_tensor.py", line 363, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/root/miniconda3/lib/python3.8/site-packages/torch/autograd/function.py", line 253, in apply
    return user_fn(self, *args)
  File "/root/miniconda3/lib/python3.8/site-packages/spikingjelly/activation_based/surrogate.py", line 1639, in backward
    return leaky_k_relu_backward(grad_output, ctx.saved_tensors[0], ctx.leak, ctx.k)
RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

我按照提示改为loss.backward(retain_graph=True)，又得到了新的错误：

Traceback (most recent call last):
  File "/root/DRIVE/main.py", line 147, in <module>
    loss.backward(retain_graph=True)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/_tensor.py", line 363, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [32]] is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

我再次按照提示加上with torch.autograd.set_detect_anomaly(True)，又产生了新的错误：

/root/miniconda3/lib/python3.8/site-packages/torch/autograd/__init__.py:173: UserWarning: Error detected in CudnnBatchNormBackward0. Traceback of forward call that caused the error:
  File "/root/DRIVE/main.py", line 143, in <module>
    outputs = s_model(inputs)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/DRIVE/spiking_unet.py", line 115, in forward
    x = self.up4(x, x1)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/DRIVE/spiking_unet.py", line 63, in forward
    x = self.conv(x)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/DRIVE/spiking_unet.py", line 23, in forward
    t = self.c(x)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/spikingjelly/activation_based/layer.py", line 465, in forward
    return functional.seq_to_ann_forward(x, super().forward)
  File "/root/miniconda3/lib/python3.8/site-packages/spikingjelly/activation_based/functional.py", line 686, in seq_to_ann_forward
    y = stateless_module(y)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 168, in forward
    return F.batch_norm(
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/functional.py", line 2421, in batch_norm
    return torch.batch_norm(
 (Triggered internally at  ../torch/csrc/autograd/python_anomaly_mode.cpp:104.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
  File "/root/DRIVE/main.py", line 147, in <module>
    loss.backward(retain_graph=True)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/_tensor.py", line 363, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [32]] is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

非常奇怪，我写的代码里并没有inplace操作。于是我将spiking_unet中改为同样结构的unet，运行后所有问题消失了。我推测在Spikingjelly中某一步出错，导致了inplace操作或反向传播出错。同时我发现，之前有类似的issue #419 ，不仅仅有我出现了这个问题。

Met4physics commented 1 year ago

我是啥杯，忘记加functional.reset_net(s_model)了，本帖结束

huang-hz commented 4 months ago

我是啥杯，忘记加functional.reset_net(s_model)了，本帖结束

感谢！刚入门百思不得其解

fangwei123456 / spikingjelly

神经元可能导致torch反向传播出现错误 #420