The current code for this model accumulates gradients when performing decoding, which invariantly causes the process to break due to OOM. If the gradient is never used to create adversarial noise in this step, it would make sense to put this block inside with torch.no_grad().
The current code for this model accumulates gradients when performing decoding, which invariantly causes the process to break due to OOM. If the gradient is never used to create adversarial noise in this step, it would make sense to put this block inside with torch.no_grad().