Open surajyakoa opened 1 year ago
Do any of the authors have an intuition of what a reasonable batch size needs to be for the centering or sk-leo techniques to work correctly?
What kind of fine-tuning are you trying to do ? Are you trying to perform dinov2-training with your own unlabeled data, but using the distilled vit-s/b as initializations ? Because I expect it wouldn't be useful as these networks have not done masked-image-modeling during their distillation phase.
About the batch size I would say your guess is as good as mine; we haven't explored batch sizes that small in our work.
Hello,
Thanks for your response. Yes, I'm hoping to perform dinov2-training with my own unlabelled data but using vit_s/vit_b as initialization. That is curious. Is there a reason why the Masked Image performance would not transfer over from the distillation objective? What if I only finetuned using the dino objective and left out the MIM?
Gotcha, have you noticed though that there is a proclivity to your embeddings collapsing if you use too small of a batch size? Or is this something you have not run into?
Thanks!
Hello,
Thanks for your response. Yes, I'm hoping to perform dinov2-training with my own unlabelled data but using vit_s/vit_b as initialization. That is curious. Is there a reason why the Masked Image performance would not transfer over from the distillation objective? What if I only finetuned using the dino objective and left out the MIM?
Gotcha, have you noticed though that there is a proclivity to your embeddings collapsing if you use too small of a batch size? Or is this something you have not run into?
Thanks!
May I ask if you've made some progress with your fine-tuning?
What kind of fine-tuning are you trying to do ? Are you trying to perform dinov2-training with your own unlabeled data, but using the distilled vit-s/b as initializations ? Because I expect it wouldn't be useful as these networks have not done masked-image-modeling during their distillation phase.
About the batch size I would say your guess is as good as mine; we haven't explored batch sizes that small in our work.
In the paper it says that iBoT loss is used for distillation. If I understand correctly, the distillation's iBoT is done with predicting a distribution with the teacher given the unmasked input, predicting a distribution with the student given the unmasked input, and minimizing the cross entropy between the two. Could you confirm this? @qasfb
What if I only finetuned using the dino objective and left out the MIM?
Same question here @TimDarcet @qasfb @patricklabatut
In the paper it says that iBoT loss is used for distillation. If I understand correctly, the distillation's iBoT is done with predicting a distribution with the teacher given the unmasked input, predicting a distribution with the student given the unmasked input, and minimizing the cross entropy between the two. Could you confirm this? @qasfb
Correct. During distillation, the iBOT loss is applied as usual, except there is no masking. So the loss is applied on the patch tokens.
What if I only finetuned using the dino objective and left out the MIM?
No idea. Our intuition is that usually MIM improves all scores, but we can't be sure for this specific setup.
To go back to the original question:
Any recommendation on what minimum batch size I woudl need for this to work well?
We haven't experimented with total_batch_size < 2048, so we don't know
And if it is large, is there any potential workarounds that could be possible to train on a single GPU?
The only batch size-dependent parts are SK and KoLeo. Both add a contrastive element to the loss, similar to a simclr, so they might behave badly at small BS.
train.centering=centering
)koleo_loss_weight
, to account for the higher loss values at low BS, or simply removing it. The training should work welll Note that lower BS also means different optimal hparams. Notably
Remember you're in uncharted territory, you might need to tweak a lot before it works, and explore very wide ranges of hyperparameters. I'm curious though! If you get results that are okay to share, I'd love to know if and how those batch sizes can work.
Thanks for the response, Timothee! Regarding the Koleo loss, in the repo I can't find a sync/reduce op before the Koleo loss is calculated. Does that mean a separate KoLeo loss is calculated for each GPU, and then the scalar loss is averaged across GPUs?
Joseph Zhu
On Fri, Jan 5, 2024 at 23:57 Timothée Darcet @.***> wrote:
In the paper it says that iBoT loss is used for distillation. If I understand correctly, the distillation's iBoT is done with predicting a distribution with the teacher given the unmasked input, predicting a distribution with the student given the unmasked input, and minimizing the cross entropy between the two. Could you confirm this? @qasfb https://github.com/qasfb
Correct. During distillation, the iBOT loss is applied as usual, except there is no masking. So the loss is applied on the patch tokens.
What if I only finetuned using the dino objective and left out the MIM?
No idea. Our intuition is that usually MIM improves all scores, but we can't be sure for this specific setup.
To go back to the original question:
Any recommendation on what minimum batch size I woudl need for this to work well?
We haven't experimented with total_batch_size < 2048, so we don't know
And if it is large, is there any potential workarounds that could be possible to train on a single GPU?
The only batch size-dependent parts are SK and KoLeo. Both add a contrastive element to the loss, similar to a simclr, so they might behave badly at small BS.
- SK: you can try replacing SK with the original DINO-style centering, which has no intra-batch dependency (train.centering=centering)
- KoLeo: You can try putting a lower koleo_loss_weight, to account for the higher loss values at low BS, or simply removing it. The training should work welll
Note that lower BS also means different optimal hparams. Notably
- the lr needs to be lower (a sqrt scaling will be done automatically by the codebase, but you might need to tweak around this value)
- The momentums need to be higher (a good rule of thumb to start with might be something around $mom' = mom^{\frac{bs'}{bs}}$. I give no guarantee for this rule.). This means of course teacher momentum, but also possibly the centering momentum https://github.com/facebookresearch/dinov2/blob/2302b6bf46953431b969155307b9bed152754069/dinov2/loss/dino_clstoken_loss.py#L17C15-L17C15, and even maybe the AdamW beta1 and beta2.
- maybe other hyperparameters too, who knows
Remember you're in uncharted territory, you might need to tweak a lot before it works, and explore very wide ranges of hyperparameters. I'm curious though! If you get results that are okay to share, I'd love to know if and how those batch sizes can work.
— Reply to this email directly, view it on GitHub https://github.com/facebookresearch/dinov2/issues/173#issuecomment-1878891181, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF2C6G6WUXHAVNHRKIVRAILYNAPHDAVCNFSM6AAAAAA34PAWS6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZYHA4TCMJYGE . You are receiving this because you commented.Message ID: @.***>
@TimDarcet
Hi @surajyakoa. Did you make any progress with your fine-tuning? I am in a similar spot where the model performance deteriorates gradually with the fine-tuning.
keeping the whole backbone frozen or unfreezing a few final layers
@surajyakoa can I ask how did you unfreeze a few layers selectively? I am trying to implement this, but am running into errors. I have two approaches in mind:
UPDATE:
managed to do so by adding the following lines to get_params_groups_with_decay()
in param_groups.py
if "blocks." in name and "residual" not in name: \
block_id = int(name[name.find("blocks.") :].split(".")[1])
if block_id < freeze_vit_nlayers:
d.update({"lr_multiplier": 0.0})
The only batch size-dependent parts are SK and KoLeo. Both add a contrastive element to the loss, similar to a simclr, so they might behave badly at small BS.
If I look at even the regular centering in the DinoLoss, it is also reducing the batch centers across the GPUs. So does this mean even centering is sensitive to batch size and so we want to compute it globally across all GPUs? Link: https://github.com/facebookresearch/dinov2/blob/e1277af2ba9496fbadf7aec6eba56e8d882d1e35/dinov2/loss/dino_clstoken_loss.py#L86
Hello!
I am trying to finetune either the vit_s or vit_b models to my dataset. I have tried training only the dino head, both the dino and ibot heads, and keeping the whole backbone frozen or unfreezing a few final layers. I have a large dataset (> 1000000 images), but I'm trying to train on a T4, meaning my batch size is either 4, 8, or 16. I always seem to see the Nearest Neighbors performance on my dataset slowly decline as I train and further analysis makes me believe all the datapoints are collapsing.
I'm wondering if this is due to my small batch size. I don't have a strong intuition yet of the centering, so I'm wondering if my small batch size basically assures that I will have my datapoints collapse (I am used to using a contrastive loss with a queue of negative examples, making this not an issue.) Any recommendation on what minimum batch size I woudl need for this to work well? And if it is large, is there any potential workarounds that could be possible to train on a single GPU?