About the training process

Dear Author， I've been confused about the training process that you calculate the detector loss using "dense_scores" from teacher model and "logits" from the student model. However, "dense_scores" is derived after a softmax function while "logits" is not. Why don't you use "prob_full" that is calculated in the local_head with softmax function which seems more reasonable? Are there any specific reasons? Looking forward to your reply.

ethz-asl / hfnet

About the training process #55