facebookresearch / dinov2

PyTorch code and models for the DINOv2 self-supervised learning method.
Apache License 2.0
8.74k stars 751 forks source link

Ideal batch size for finetuning #173

Open surajyakoa opened 1 year ago

surajyakoa commented 1 year ago

Hello!

I am trying to finetune either the vit_s or vit_b models to my dataset. I have tried training only the dino head, both the dino and ibot heads, and keeping the whole backbone frozen or unfreezing a few final layers. I have a large dataset (> 1000000 images), but I'm trying to train on a T4, meaning my batch size is either 4, 8, or 16. I always seem to see the Nearest Neighbors performance on my dataset slowly decline as I train and further analysis makes me believe all the datapoints are collapsing.

I'm wondering if this is due to my small batch size. I don't have a strong intuition yet of the centering, so I'm wondering if my small batch size basically assures that I will have my datapoints collapse (I am used to using a contrastive loss with a queue of negative examples, making this not an issue.) Any recommendation on what minimum batch size I woudl need for this to work well? And if it is large, is there any potential workarounds that could be possible to train on a single GPU?

surajyakoa commented 1 year ago

Do any of the authors have an intuition of what a reasonable batch size needs to be for the centering or sk-leo techniques to work correctly?

qasfb commented 1 year ago

What kind of fine-tuning are you trying to do ? Are you trying to perform dinov2-training with your own unlabeled data, but using the distilled vit-s/b as initializations ? Because I expect it wouldn't be useful as these networks have not done masked-image-modeling during their distillation phase.

About the batch size I would say your guess is as good as mine; we haven't explored batch sizes that small in our work.

surajyakoa commented 1 year ago

Hello,

Thanks for your response. Yes, I'm hoping to perform dinov2-training with my own unlabelled data but using vit_s/vit_b as initialization. That is curious. Is there a reason why the Masked Image performance would not transfer over from the distillation objective? What if I only finetuned using the dino objective and left out the MIM?

Gotcha, have you noticed though that there is a proclivity to your embeddings collapsing if you use too small of a batch size? Or is this something you have not run into?

Thanks!

GZ-YourZY commented 10 months ago

Hello,

Thanks for your response. Yes, I'm hoping to perform dinov2-training with my own unlabelled data but using vit_s/vit_b as initialization. That is curious. Is there a reason why the Masked Image performance would not transfer over from the distillation objective? What if I only finetuned using the dino objective and left out the MIM?

Gotcha, have you noticed though that there is a proclivity to your embeddings collapsing if you use too small of a batch size? Or is this something you have not run into?

Thanks!

May I ask if you've made some progress with your fine-tuning?

JunzheJosephZhu commented 8 months ago

What kind of fine-tuning are you trying to do ? Are you trying to perform dinov2-training with your own unlabeled data, but using the distilled vit-s/b as initializations ? Because I expect it wouldn't be useful as these networks have not done masked-image-modeling during their distillation phase.

About the batch size I would say your guess is as good as mine; we haven't explored batch sizes that small in our work.

In the paper it says that iBoT loss is used for distillation. If I understand correctly, the distillation's iBoT is done with predicting a distribution with the teacher given the unmasked input, predicting a distribution with the student given the unmasked input, and minimizing the cross entropy between the two. Could you confirm this? @qasfb

JunzheJosephZhu commented 8 months ago

What if I only finetuned using the dino objective and left out the MIM?

Same question here @TimDarcet @qasfb @patricklabatut

TimDarcet commented 8 months ago

In the paper it says that iBoT loss is used for distillation. If I understand correctly, the distillation's iBoT is done with predicting a distribution with the teacher given the unmasked input, predicting a distribution with the student given the unmasked input, and minimizing the cross entropy between the two. Could you confirm this? @qasfb

Correct. During distillation, the iBOT loss is applied as usual, except there is no masking. So the loss is applied on the patch tokens.

What if I only finetuned using the dino objective and left out the MIM?

No idea. Our intuition is that usually MIM improves all scores, but we can't be sure for this specific setup.

To go back to the original question:

Any recommendation on what minimum batch size I woudl need for this to work well?

We haven't experimented with total_batch_size < 2048, so we don't know

And if it is large, is there any potential workarounds that could be possible to train on a single GPU?

The only batch size-dependent parts are SK and KoLeo. Both add a contrastive element to the loss, similar to a simclr, so they might behave badly at small BS.

Note that lower BS also means different optimal hparams. Notably

Remember you're in uncharted territory, you might need to tweak a lot before it works, and explore very wide ranges of hyperparameters. I'm curious though! If you get results that are okay to share, I'd love to know if and how those batch sizes can work.

JunzheJosephZhu commented 8 months ago

Thanks for the response, Timothee! Regarding the Koleo loss, in the repo I can't find a sync/reduce op before the Koleo loss is calculated. Does that mean a separate KoLeo loss is calculated for each GPU, and then the scalar loss is averaged across GPUs?

Joseph Zhu

On Fri, Jan 5, 2024 at 23:57 Timothée Darcet @.***> wrote:

In the paper it says that iBoT loss is used for distillation. If I understand correctly, the distillation's iBoT is done with predicting a distribution with the teacher given the unmasked input, predicting a distribution with the student given the unmasked input, and minimizing the cross entropy between the two. Could you confirm this? @qasfb https://github.com/qasfb

Correct. During distillation, the iBOT loss is applied as usual, except there is no masking. So the loss is applied on the patch tokens.

What if I only finetuned using the dino objective and left out the MIM?

No idea. Our intuition is that usually MIM improves all scores, but we can't be sure for this specific setup.

To go back to the original question:

Any recommendation on what minimum batch size I woudl need for this to work well?

We haven't experimented with total_batch_size < 2048, so we don't know

And if it is large, is there any potential workarounds that could be possible to train on a single GPU?

The only batch size-dependent parts are SK and KoLeo. Both add a contrastive element to the loss, similar to a simclr, so they might behave badly at small BS.

  • SK: you can try replacing SK with the original DINO-style centering, which has no intra-batch dependency (train.centering=centering)
  • KoLeo: You can try putting a lower koleo_loss_weight, to account for the higher loss values at low BS, or simply removing it. The training should work welll

Note that lower BS also means different optimal hparams. Notably

Remember you're in uncharted territory, you might need to tweak a lot before it works, and explore very wide ranges of hyperparameters. I'm curious though! If you get results that are okay to share, I'd love to know if and how those batch sizes can work.

— Reply to this email directly, view it on GitHub https://github.com/facebookresearch/dinov2/issues/173#issuecomment-1878891181, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF2C6G6WUXHAVNHRKIVRAILYNAPHDAVCNFSM6AAAAAA34PAWS6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZYHA4TCMJYGE . You are receiving this because you commented.Message ID: @.***>

JunzheJosephZhu commented 8 months ago

@TimDarcet

amundra15 commented 2 months ago

Hi @surajyakoa. Did you make any progress with your fine-tuning? I am in a similar spot where the model performance deteriorates gradually with the fine-tuning.

amundra15 commented 1 month ago

keeping the whole backbone frozen or unfreezing a few final layers

@surajyakoa can I ask how did you unfreeze a few layers selectively? I am trying to implement this, but am running into errors. I have two approaches in mind:

  1. Set the requires_grad flag to false for the initial layers. However, upon passing this model to FSDP, it complains that the requires_grad for all layers need to be uniform.
  2. Set the LR for the initial layers to 0. However, I am not able to figure out how various layers are mapped to the params_group using get_params_groups().

UPDATE: managed to do so by adding the following lines to get_params_groups_with_decay() in param_groups.py

       if "blocks." in name and "residual" not in name: \
            block_id = int(name[name.find("blocks.") :].split(".")[1])
            if block_id < freeze_vit_nlayers:
                d.update({"lr_multiplier": 0.0})
ganeshv-cerebras commented 1 week ago

The only batch size-dependent parts are SK and KoLeo. Both add a contrastive element to the loss, similar to a simclr, so they might behave badly at small BS.

If I look at even the regular centering in the DinoLoss, it is also reducing the batch centers across the GPUs. So does this mean even centering is sensitive to batch size and so we want to compute it globally across all GPUs? Link: https://github.com/facebookresearch/dinov2/blob/e1277af2ba9496fbadf7aec6eba56e8d882d1e35/dinov2/loss/dino_clstoken_loss.py#L86