Dear Author,
I've been confused about the training process that you calculate the detector loss using "dense_scores" from teacher model and "logits" from the student model. However, "dense_scores" is derived after a softmax function while "logits" is not. Why don't you use "prob_full" that is calculated in the local_head with softmax function which seems more reasonable? Are there any specific reasons? Looking forward to your reply.
Dear Author, I've been confused about the training process that you calculate the detector loss using "dense_scores" from teacher model and "logits" from the student model. However, "dense_scores" is derived after a softmax function while "logits" is not. Why don't you use "prob_full" that is calculated in the local_head with softmax function which seems more reasonable? Are there any specific reasons? Looking forward to your reply.