Open Akella17 opened 5 years ago
@fmcp I was going through your paper and am a little bit confused with the idea of distillation loss. More specifically, I am assuming that the ground-truth probability distribution of each sample is a one-hot vector. This means that the distillation loss effectively contains only one of the classes since all other classes would have pi=0. So, I don't see any point of raising the probabilities by 1/T, since it effectively only scales the loss term by a factor of T linearly (loss = -log qi1/T = -1/T*log qi, where i the index of the ground truth class).
Also, coming to qi, I am assuming that by logits you mean the normalized softmax probabilities over the total classes (old + new).
I also feel confused about this, do you understand the meaning of pdisti in the paper?
@fmcp I was going through your paper and am a little bit confused with the idea of distillation loss. More specifically, I am assuming that the ground-truth probability distribution of each sample is a one-hot vector. This means that the distillation loss effectively contains only one of the classes since all other classes would have pi=0. So, I don't see any point of raising the probabilities by 1/T, since it effectively only scales the loss term by a factor of T linearly (loss = -log qi1/T = -1/T*log qi, where i the index of the ground truth class).
Also, coming to qi, I am assuming that by logits you mean the normalized softmax probabilities over the total classes (old + new).