facebookresearch / dino

PyTorch code for Vision Transformers training with the Self-Supervised learning method DINO
Apache License 2.0
6.25k stars 905 forks source link

Why do we skip cases where the student and teacher operate on the same view? If they are operating on different views, why should they produce similar results to calculate the cross-entropy loss? #267

Open jinghere11 opened 9 months ago

jinghere11 commented 9 months ago
    total_loss = 0
    n_loss_terms = 0
    for iq, q in enumerate(teacher_out):
        for v in range(len(student_out)):
            if v == iq:
                # we skip cases where student and teacher operate on the same view
                continue
            loss = torch.sum(-q * F.log_softmax(student_out[v], dim=-1), dim=-1)
            total_loss += loss.mean()
            n_loss_terms += 1
fbliman commented 7 months ago

I am not an expert, but my intuition is that feeding the same image will lead to a very small loss and hence an insginificat training, so is wasted resources

but thats only a guess