ByChelsea / VAND-APRIL-GAN

[CVPR 2023 Workshop] VAND Challenge: 1st Place on Zero-shot AD and 4th Place on Few-shot AD
193 stars 22 forks source link

关于训练时梯度的问题 #27

Open genzhengmiaohong opened 7 months ago

genzhengmiaohong commented 7 months ago

您好,我在修改train.py文件进行网络训练的时候,在最后loss计算梯度的时候出现了如下错误:RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation,请问您知道该问题如何解决吗?我的cuda版本12.2,因此使用requirement.txt中的版本不合适,我先使用了torch2.1.0的版本,之后更换到 2.2.1+cu118版本均会出现该问题。希望您的回复。

tangyz213 commented 7 months ago

你解决了吗?我也遇到了这个问题

ByChelsea commented 7 months ago

Can you provide more detailed error information, please? I need to pinpoint the location of the error.

yangzc0214 commented 5 months ago

Can you provide more detailed error information, please? I need to pinpoint the location of the error.

Traceback (most recent call last): File "train.py", line 177, in train(args) File "train.py", line 140, in train loss.backward() File "C:\Users\yzc.conda\envs\APRIL_GAN\lib\site-packages\torch_tensor.py", line 522, in backward torch.autograd.backward( File "C:\Users\yzc.conda\envs\APRIL_GAN\lib\site-packages\torch\autograd__init__.py", line 266, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.HalfTensor [8, 1369, 768]], which is output 0 of DivBackward0, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).


my env: windows11, torch 2.2.2+cu121 In my env, I modified line 122 in train.py to the following and then the error disappeared

patch_tokens[layer] = patch_tokens[layer] / patch_tokens[layer].norm(dim=-1, keepdim=True)

oylz commented 5 months ago

fix it here