关于 Feature Distillation

haozhi1817 commented 1 week ago

在我的认知里面，我一直以为所谓shallow和deep features，指的是shallow 和deep block的feature。例如Unet中，E1的输出是shallow feature，E3的输出是deep feature。但是在您的论文中，您提到Taking the output of E1 as an illustration, we calculate the feature similarity matrices in the channel dimension at both the shallow and the deep levels.。因为每一个Block都由两个layer组成，所以我以为您这里指的是level，应该是指第一个layer的输出是shallow feature，第二个layer的输出是deep feature。然而在您后续的论文中，您又提到Inspired by this, we employed the Lp norm for information distillation from shallow (top-half channel feature) to deep (bottom-half channel one), which guides the deeper features learned the useful context information，而且您的代码中也确实直接将每一个block的输出根据normalization排序后分别取上下半channel作为所谓的shallow与deep feature。所以我很好奇，您对shallow与deep的定义到底是什么？而且我不太明白，对于任意一个Block的输出，如果我们计算他们一半特征与另一半特征的mse并将其作为loss，那么极限状态下，不会导致block的输出完全一致吗？但是您又提到您的这个损失是为了缓解特征冗余的，这不就与理想状态完全背道而驰了吗？可能我看漏了某些内容，或者我的理解有误，还请您不吝赐教。

ChongQingNoSubway commented 1 week ago

Here deep and shallow refer to the first and second halves of the channel in a given layer. For example, for the feature(B x C x H x W) in layer 1 of E2, deep is [B x 0... .C/2 x H x W], and shallow[B x C/2.... .C/2 x H x W], Shallow[B x C/2... C x H x W], which is mainly used to measure whether the similarity of the features over the channel is high. The l1/l2 norm is used to guide the sparsity of the features to solve the feature redundancy. We believe this is not an elegant solution and can be solved with mods such as diversity loss or super token.

haozhi1817 commented 1 week ago

多谢您的回复，我理解了关于shallow与deep的定义。我认为这部分loss采用mse，仅仅能解决您在论文中提“The resultant redundant features often come with task-irrelevant visual features, leading to performance degradation and unnecessary computation overhead.”中的task-irrelevant，mse会导致norm小的feature与norm大的feature在数值上趋同，这与redundant有些矛盾的，或许这也是您说的not elegant的原因吧。再次感谢您的回复。

ChongQingNoSubway commented 1 week ago

感谢你对本文感兴趣。使用显著特征监督冗余特征并不矛盾. 这种方式在模型剪枝领域比较常见你可以参考： Wen, W., Wu, C., Wang, Y., Chen, Y., Li, H.: Learning structured sparsity in deep neural networks. Advances in neural information processing systems 29 (2016) Zhao, C., Ni, B., Zhang, J., Zhao, Q., Zhang, W., Tian, Q.: Variational convolutional neural network pruning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2780–2789 (2019) Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., Zhang, C.: Learning efficient convolutional networks through network slimming. In: Proceedings of the IEEE international conference on computer vision. pp. 2736–2744 (2017) Li, L.: Self-regulated feature learning via teacher-free feature distillation. In: European Conference on Computer Vision. pp. 347–363. Springer (2022)

为什么说这里可能不是很elegant的是因为假设了上半部分的特征是显著的，采用C/2是一个硬阈值的选择. 因此提到可能有很多其他方法能够很好的解决

haozhi1817 commented 1 week ago

我对剪枝领域没有任何涉猎，非常感谢您的花费时间与精力给出的推荐，我会认真阅读相关论文。再次感谢并祝您生活愉快。

ChongQingNoSubway / SelfReg-UNet

关于 Feature Distillation #9