Why not use cross entropy for determining which features to mask?

DeLightCMU / RSC

This is the official implementation of Self-Challenging Improves Cross-Domain Generalization, ECCV2020

BSD 2-Clause "Simplified" License

160 stars 18 forks source link

In equation 1 in the paper, you compute the gradient of the element-wise product and the ground truth one-hot label with respect to the input feature vector. This is to find the features that contribute most to the ground truth class logit. For a softmax output, ideally we want the true label logit to be towards positive infinity while the other logits to be towards negative infinity.

So my question is, why not compute a more classical cross-entropy loss here: https://github.com/DeLightCMU/RSC/blob/63726803bafd66184cac87d0db8de0c0d58889ba/models/resnet.py#L90

instead of just the sum of the true logits?

DeLightCMU / RSC

Why not use cross entropy for determining which features to mask? #11