Open jsrdcht opened 1 year ago
I found that the author's teacher model output has gone through a norm(self.feature_model.norm(x_tgt)), and then there is self.ln_tgt(x_tgt). The output equivalent to the teacher model has gone through Layer Norm twice. I don't quite understand this.
You mentioned that the whitening operation is non-parametric. But it seems you implemented it by norm operation from the original paper which is not non-parametric.