refine problem - Githubissues

nora1827 commented 1 year ago

When I train the model using python main.py , it works well. But when I try to refine the model, using python main.py --refine --lr 1e-5 --reload --previous_dir it reports error：

Traceback (most recent call last): File "main.py", line 225, in loss = train(opt, actions, train_dataloader, model, optimizer_all, epoch) File "main.py", line 23, in train return step('train', opt, actions, train_loader, model, optimizer, epoch) File "main.py", line 94, in step loss.backward() File "D:\Anaconda\envs\pose\lib\site-packages\torch_tensor.py", line 307, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "D:\Anaconda\envs\pose\lib\site-packages\torch\autograd__init__.py", line 156, in backward allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [64, 1024]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

I can't fine the problem, I almostly use with torch.autograd.set_detect_anomaly(True): , but still can't find the crux.

Vegetebird commented 1 year ago

See https://github.com/Vegetebird/StridedTransformer-Pose3D/issues/19#issuecomment-1276159365

nora1827 commented 1 year ago

See #19 (comment)

Thank you! Using torch==1.7.1 could avoid this problem, and "nn.Relu(inplace=True)" couldn't.

Vegetebird / StridedTransformer-Pose3D

refine problem #27