WXinlong / SOLO

SOLO and SOLOv2 for instance segmentation, ECCV 2020 & NeurIPS 2020.
Other
1.7k stars 306 forks source link

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation #201

Open Bananavision opened 2 years ago

Bananavision commented 2 years ago

Running a custom data set on torch v 1.5 gives the following

Traceback (most recent call last): File "/home/usr/project/solov2/SOLO/tools/train.py", line 125, in <module> main() File "/home/usr/project/solov2/SOLO/tools/train.py", line 115, in main train_detector( File "/home/usr/project/solov2/SOLO/mmdet/apis/train.py", line 107, in train_detector _non_dist_train( File "/home/usr/project/solov2/SOLO/mmdet/apis/train.py", line 299, in _non_dist_train runner.run(data_loaders, cfg.workflow, cfg.total_epochs) File "/home/usr/project/solov2/lib/python3.9/site-packages/mmcv-0.2.16-py3.9-linux-x86_64.egg/mmcv/runner/runner.py", line 364, in run epoch_runner(data_loaders[i], **kwargs) File "/home/usr/project/solov2/lib/python3.9/site-packages/mmcv-0.2.16-py3.9-linux-x86_64.egg/mmcv/runner/runner.py", line 275, in train self.call_hook('after_train_iter') File "/home/usr/project/solov2/lib/python3.9/site-packages/mmcv-0.2.16-py3.9-linux-x86_64.egg/mmcv/runner/runner.py", line 231, in call_hook getattr(hook, fn_name)(self) File "/home/usr/project/solov2/lib/python3.9/site-packages/mmcv-0.2.16-py3.9-linux-x86_64.egg/mmcv/runner/hooks/optimizer.py", line 19, in after_train_iter runner.outputs['loss'].backward() File "/home/usr/project/solov2/lib/python3.9/site-packages/torch/_tensor.py", line 307, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/usr/project/solov2/lib/python3.9/site-packages/torch/autograd/__init__.py", line 154, in backward Variable._execution_engine.run_backward( RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2, 128, 336, 200]], which is output 0 of ReluBackward0, is at version 3; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

Is this a specific error for torch >= 1.5? What is the current workaround?

Thanks

haqishen commented 2 years ago

Hi, I've got the same issue with you.

My env: cuda 11.3 pytorch 1.10

I fix the bug by this modification, please have try ;)

204

IremYoldas commented 2 years ago

I think this is a specific error for torch >= 1.5. I got this error too and I was already using CUDA 10.1 but I downgrade Pytorch version to 1.4. I fixed the bug that way. PS: Before downgrade Pytorch version I tried many things. (Basically I did everything that I found. For instance, I changed nn.relu(inplace=True) to False). I added .step after ReLu backward etc.)

abhiagwl4262 commented 1 year ago

How to handle this if I don't want to downgrade my pytorch version ?

Sue-Tang-Up commented 6 months ago

I have the same question, so how to handle this if I don't want to downgrade my pytorch version ?

yd7am commented 6 months ago

I have the same question, so how to handle this if I don't want to downgrade my pytorch version ? x = self.activate(x) -> x = self.activate(x).clone()