Questions about the implementation.

DeLightCMU / RSC

This is the official implementation of Self-Challenging Improves Cross-Domain Generalization, ECCV2020

BSD 2-Clause "Simplified" License

160 stars 18 forks source link

Questions about the implementation. #15

Closed ZerojumpLine closed 3 years ago

ZerojumpLine commented 3 years ago

Thanks for sharing your code and interesting works. I have some concerns about the method details here.

In the newest version of the code, it seems that spatial-wise RSC is calcualted based on the activation instead of gradients. Is it intended? https://github.com/DeLightCMU/RSC/blob/79846abca815b416a71be4cc7c6e0b65923a2603/Domain_Generalization/models/resnet.py#L102
Another problem confuse me is why we choose to mute the top 33% features that have largest gradients? I mean, the bottem 33% features (of which the gradients are negative with a large value) are also very correlated, in a negative way.

Thanks, Zeju

Justinhzy commented 3 years ago

Hi Zeju, Thanks for your questions. 1, Unlike VGG and Alexnet, Resnet uses average pooling before fc-layers. So I modify the implementation so that it can be plugged into the different network architecture easily. 2, Interesting question. According to me, negative gradients will decrease ground truth's logit or confidence. It is possible that muting the bottom features with negative gradients will directly increase ground truth's logit or confidence, which does not help the network to challenge itself. In practice, I remember that muting top features performs better than muting bottom features in CIFAR and ImageNet datasets.

Tell me if you have any further questions.

ZerojumpLine commented 3 years ago

Thanks for your reply. It is very helpful.

AhmedFrikha commented 3 years ago

Can you please elaborate more on the question (1) ? I still don't understand why the spatial-wise RSC was implemented as a sum of the activations (x_new) weighted by the mean gradients of each channel (spatial_mean = torch.sum(x_new * grad_channel_mean, 1)).

Based on the description in the paper ("global average pooling is applied along the channel dimension to the gradient tensor G to produce a weighting matrix wi of size [7 × 7]."), instead of lines 99-103 it should be just: spatial_mean = torch.mean(grads_val.view(num_rois, num_channel, -1), dim=1).