Open rsokl opened 2 years ago
mean = sum / N
, and thus partial mean / partial input = (1/N) partial sum / partial input
.
As PGD use the sign of gradient, we have sign(partial mean / partial input) = sign((1/N) partial sum / ..) = sign(partial sum/..)
. So mean lead to the same result as sum.
Right, as I stated "obviously, this does not effect methods where the gradient is normalized." The point is that this happens to not affect methods like FGSM because of the signed gradient, but other methods would yield the incorrect behavior.
Indeed. As long as sign(grad)
is not in the update equation, it will trigger weird bugs for people who want to customize new algorithms.
https://github.com/MadryLab/robustness/blob/a9541241defd9972e9334bfcdb804f6aefe24dc7/robustness/attacker.py#L195
Assuming that you are solving for per-datum perturbations, and not a broadcasted (or uniform) perturbation, then the loss-aggregation performed prior to backprop should be
sum
, and notmean
. Usingmean
, the gradient of each perturbation in the batch is scaled by the inverse batch size, whereas the perturbation's gradient should be independent of batch size. Obviously, this does not effect methods where the gradient is normalized.