Closed ZerojumpLine closed 3 years ago
Hi Zeju, Thanks for your questions. 1, Unlike VGG and Alexnet, Resnet uses average pooling before fc-layers. So I modify the implementation so that it can be plugged into the different network architecture easily. 2, Interesting question. According to me, negative gradients will decrease ground truth's logit or confidence. It is possible that muting the bottom features with negative gradients will directly increase ground truth's logit or confidence, which does not help the network to challenge itself. In practice, I remember that muting top features performs better than muting bottom features in CIFAR and ImageNet datasets.
Tell me if you have any further questions.
Thanks for your reply. It is very helpful.
Can you please elaborate more on the question (1) ? I still don't understand why the spatial-wise RSC was implemented as a sum of the activations (x_new) weighted by the mean gradients of each channel (spatial_mean = torch.sum(x_new * grad_channel_mean, 1)).
Based on the description in the paper ("global average pooling is applied along the channel dimension to the gradient tensor G to produce a weighting matrix wi of size [7 × 7]."), instead of lines 99-103 it should be just: spatial_mean = torch.mean(grads_val.view(num_rois, num_channel, -1), dim=1).
Thanks for sharing your code and interesting works. I have some concerns about the method details here.
In the newest version of the code, it seems that spatial-wise RSC is calcualted based on the activation instead of gradients. Is it intended? https://github.com/DeLightCMU/RSC/blob/79846abca815b416a71be4cc7c6e0b65923a2603/Domain_Generalization/models/resnet.py#L102
Another problem confuse me is why we choose to mute the top 33% features that have largest gradients? I mean, the bottem 33% features (of which the gradients are negative with a large value) are also very correlated, in a negative way.
Thanks, Zeju