loss become Nan without Fsdp

facebookresearch / dinov2

PyTorch code and models for the DINOv2 self-supervised learning method.

Apache License 2.0

8.95k stars 787 forks source link

loss become Nan without Fsdp #314

Open Oopslulu opened 11 months ago

Oopslulu commented 11 months ago

I delete the code “model.prepare_for_distributed_training() ” in dinov2/train.train.py

then my loss become Nan after I train this model after only 1 iter.

I don't know why， I just changed an operation regarding distributed training😣

please help me if you know the reason, thanks a lot!!!

evyatar-bur commented 10 months ago

If your batch size is 1 you might get a Nan loss because of the koleo loss used in the dino training, which is calculated using the smallest Euclidian distance between samples inside the batch to encourage a uniform span of the features within the batch.

If you can't increase the batch size you can try setting koleo_loss_weight=0, or adding an exception to the code where the koleo loss is calculated.