关于训练时梯度的问题

genzhengmiaohong commented 7 months ago

您好，我在修改train.py文件进行网络训练的时候，在最后loss计算梯度的时候出现了如下错误：RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation，请问您知道该问题如何解决吗？我的cuda版本12.2，因此使用requirement.txt中的版本不合适，我先使用了torch2.1.0的版本，之后更换到 2.2.1+cu118版本均会出现该问题。希望您的回复。

tangyz213 commented 7 months ago

你解决了吗？我也遇到了这个问题

ByChelsea commented 7 months ago

Can you provide more detailed error information, please? I need to pinpoint the location of the error.

yangzc0214 commented 5 months ago

Can you provide more detailed error information, please? I need to pinpoint the location of the error.

Traceback (most recent call last): File "train.py", line 177, in train(args) File "train.py", line 140, in train loss.backward() File "C:\Users\yzc.conda\envs\APRIL_GAN\lib\site-packages\torch_tensor.py", line 522, in backward torch.autograd.backward( File "C:\Users\yzc.conda\envs\APRIL_GAN\lib\site-packages\torch\autograd__init__.py", line 266, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.HalfTensor [8, 1369, 768]], which is output 0 of DivBackward0, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

my env: windows11, torch 2.2.2+cu121 In my env, I modified line 122 in train.py to the following and then the error disappeared

patch_tokens[layer] = patch_tokens[layer] / patch_tokens[layer].norm(dim=-1, keepdim=True)

oylz commented 5 months ago

fix it here

ByChelsea / VAND-APRIL-GAN

关于训练时梯度的问题 #27