Hi, thanks for your great work. In your paper, you said that "We remove the ReLU activation at the last block of each ResNet " , but i have read the code of "my_resnet.py" and found that you remove every last ReLU activation in each block of ResNet.
ReLU is removed at the very last block because we need the full amplitude (-neg and +pos) for the cosine classifier (pretty much everyone does that)
ReLU is also removed for the end of the other blocks because it's work better for the distillation losses POD. There are some papers & blog articles that remarked that removing these ReLUs doesn't change the performance of a ResNet.
Hi, thanks for your great work. In your paper, you said that "We remove the ReLU activation at the last block of each ResNet " , but i have read the code of "my_resnet.py" and found that you remove every last ReLU activation in each block of ResNet.