Junjue-Wang / Rank1-Ali-Tianchi-Real-World-Image-Forgery-Localization-Challenge

2022阿里天池真实场景篡改图像检测挑战赛-冠军方案(1/1149)
175 stars 29 forks source link

运行报错 #12

Open Man1978-scd opened 1 year ago

Man1978-scd commented 1 year ago

当我使用torch1.10.0的时候 执行训练脚本 bash tools/dist_train.sh work_configs/tamper/tamper_convx_b_exp.py 2 报错如下: RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.HalfTensor [5, 512, 32, 32]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True). 报错位置处于mmcv库中runner/epoch_based_runner.py文件中 以为是版本问题 于是我降低torch到1.5,变成另外一个报错 RuntimeError: The size of tensor a (2) must match the size of tensor b (128) at non-singleton dimension 3 我不知道该怎么定位这个问题🤔,恳请作者提供requirements对应的版本,以及使用教程文档🙀

Man1978-scd commented 1 year ago

已经解决🤮

在debug的时候,发现模型在使用混合精度优化时,用到mmcv.runner.hooks.optimizer.py中的模型权重拷贝的函数中

def copy_grads_to_fp32(self, fp16_net, fp32_weights):
    """Copy gradients from fp16 model to fp32 weight copy."""
    for fp32_param, fp16_param in zip(fp32_weights,
                                      fp16_net.parameters()):
        if fp16_param.grad is not None:
            if fp32_param.grad is None:
                fp32_param.grad = fp32_param.data.new(
                    fp32_param.size())
            fp32_param.grad.copy_(fp16_param.grad)

fp32_paramfp16_param 的grad维度不一致导致拷贝失败,torch1.5 在返回 fp16_net.parameters 时一会返回weight部分的Tensor,一会又返回bias部分的Tensor ,导致维度不一致,我也是服了。最后升级到torch1.6正常运行🙀