cleverhans-lab / cleverhans

An adversarial example library for constructing attacks, building defenses, and benchmarking both
MIT License
6.21k stars 1.39k forks source link

cleverhans.torch.utils.clip_eta will cause GPU memory leaking for its in-place operation #1230

Open Darius-H opened 2 years ago

Darius-H commented 2 years ago

cleverhans.torch.utils.clip_eta will cause GPU memory leaking for its in-place operation when param norm==2

def clip_eta(eta, norm, eps):
    """
    PyTorch implementation of the clip_eta in utils_tf.

    :param eta: Tensor
    :param norm: np.inf, 1, or 2
    :param eps: float
    """
    if norm not in [np.inf, 1, 2]:
        raise ValueError("norm must be np.inf, 1, or 2.")

    avoid_zero_div = torch.tensor(1e-12, dtype=eta.dtype, device=eta.device)
    reduc_ind = list(range(1, len(eta.size())))
    if norm == np.inf:
        eta = torch.clamp(eta, -eps, eps)
    else:
        if norm == 1:
            raise NotImplementedError("L1 clip is not implemented.")
            norm = torch.max(
                avoid_zero_div, torch.sum(torch.abs(eta), dim=reduc_ind, keepdim=True)
            )
        elif norm == 2:
            norm = torch.sqrt(
                torch.max(
                    avoid_zero_div, torch.sum(eta ** 2, dim=reduc_ind, keepdim=True)
                )
            )
        factor = torch.min(
            torch.tensor(1.0, dtype=eta.dtype, device=eta.device), eps / norm
        )
       # eta *= factor # this line used in-place operation and causes memory leaking. When i call this function in a for loop, the allocated memory keeps increasing and result in Out-of-Memory ,you would better change it as the line below
        return eta * factor

    return eta
kylematoba commented 2 years ago

This is a pretty serious bug. E.g. https://github.com/cleverhans-lab/cleverhans/blob/master/tutorials/torch/cifar10_tutorial.py does not work with norm == 2 and adversarial training turned on. The loss.backward() die with something like

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [128, 3, 32, 32]], which is output 0 of MulBackward0, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

I'd fix it except, you know, my six month old PR for another bug hasn't been looked at :unamused:.