Open Man1978-scd opened 1 year ago
已经解决🤮
在debug的时候,发现模型在使用混合精度优化时,用到mmcv.runner.hooks.optimizer.py中的模型权重拷贝的函数中
def copy_grads_to_fp32(self, fp16_net, fp32_weights):
"""Copy gradients from fp16 model to fp32 weight copy."""
for fp32_param, fp16_param in zip(fp32_weights,
fp16_net.parameters()):
if fp16_param.grad is not None:
if fp32_param.grad is None:
fp32_param.grad = fp32_param.data.new(
fp32_param.size())
fp32_param.grad.copy_(fp16_param.grad)
fp32_param
和 fp16_param
的grad维度不一致导致拷贝失败,torch1.5 在返回 fp16_net.parameters
时一会返回weight部分的Tensor,一会又返回bias部分的Tensor ,导致维度不一致,我也是服了。最后升级到torch1.6正常运行🙀
当我使用torch1.10.0的时候 执行训练脚本
bash tools/dist_train.sh work_configs/tamper/tamper_convx_b_exp.py 2
报错如下:RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.HalfTensor [5, 512, 32, 32]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
报错位置处于mmcv库中runner/epoch_based_runner.py文件中 以为是版本问题 于是我降低torch到1.5,变成另外一个报错RuntimeError: The size of tensor a (2) must match the size of tensor b (128) at non-singleton dimension 3
我不知道该怎么定位这个问题🤔,恳请作者提供requirements对应的版本,以及使用教程文档🙀