PyTorch code for Vision Transformers training with the Self-Supervised learning method DINO
Apache License 2.0
6.25k
stars
905
forks
source link
Why do we skip cases where the student and teacher operate on the same view? If they are operating on different views, why should they produce similar results to calculate the cross-entropy loss? #267
total_loss = 0
n_loss_terms = 0
for iq, q in enumerate(teacher_out):
for v in range(len(student_out)):
if v == iq:
# we skip cases where student and teacher operate on the same view
continue
loss = torch.sum(-q * F.log_softmax(student_out[v], dim=-1), dim=-1)
total_loss += loss.mean()
n_loss_terms += 1
I am not an expert, but my intuition is that feeding the same image will lead to a very small loss and hence an insginificat training, so is wasted resources