Open Oopslulu opened 11 months ago
If your batch size is 1 you might get a Nan loss because of the koleo loss used in the dino training, which is calculated using the smallest Euclidian distance between samples inside the batch to encourage a uniform span of the features within the batch.
If you can't increase the batch size you can try setting koleo_loss_weight=0, or adding an exception to the code where the koleo loss is calculated.
I delete the code “model.prepare_for_distributed_training() ” in dinov2/train.train.py
then my loss become Nan after I train this model after only 1 iter.
I don't know why, I just changed an operation regarding distributed training😣
please help me if you know the reason, thanks a lot!!!