Open LYJhere opened 6 days ago
Hello.
The operation in the code is (in practice) the same as the one described in the paper, but the implementation is different for reasons of efficiency. In effect, the two cases (s>1/K and s<1/K) are implemented with different weight vectors, as you note in your comment. The comparison specifically is "represented" in the relu operators.
I have some issues in following codes: @torch.no_grad() def centering(self, teacher_output):
labels_pre = teacher_output.argmax(-1) lower_w, higher_w, = self.centering_modifiers() teacher_output = 1-(-teacher_output+1) lower_w # Decreasing the cosine DISTANCE for small clusters teacher_output = (teacher_output+1) higher_w - 1 # Decreasing the cosine SIMILARITY for large clusters teacher_output_argmax = teacher_output.argmax(-1) teacher_output_argmax_oh = F.one_hot(teacher_output_argmax, self.out_dim)
Does the lower_w in the code mean sk, higher_w in the code mean 1/sk ?
I can not find the comparation between S and 1/K in the code but depicted in your paper.