Need clarification regarding distillation loss

fmcp / EndToEndIncrementalLearning

End-to-End Incremental Learning

66 stars 18 forks source link

@fmcp I was going through your paper and am a little bit confused with the idea of distillation loss. More specifically, I am assuming that the ground-truth probability distribution of each sample is a one-hot vector. This means that the distillation loss effectively contains only one of the classes since all other classes would have pi=0. So, I don't see any point of raising the probabilities by 1/T, since it effectively only scales the loss term by a factor of T linearly (loss = -log qi1/T = -1/T*log qi, where i the index of the ground truth class).

Also, coming to qi, I am assuming that by logits you mean the normalized softmax probabilities over the total classes (old + new).

I also feel confused about this, do you understand the meaning of pdisti in the paper?

fmcp / EndToEndIncrementalLearning

Need clarification regarding distillation loss #10