Learning rate used for fine-tuning on ImageNet-1k

facebookresearch / dinov2

PyTorch code and models for the DINOv2 self-supervised learning method.

Apache License 2.0

8.9k stars 783 forks source link

Learning rate used for fine-tuning on ImageNet-1k #252

Open MohammedSB opened 12 months ago

MohammedSB commented 12 months ago

With regards to this sentence:

In Table 5, we show that the Top-1 accuracy on the validation set of ImageNet-1k improves by more than +2% when the backbone is fine-tuned.

What was the learning rate used for finetuning the backbone?

barbolo commented 5 months ago

I'm also looking for this, but I guess I'll not find the exact learning rate that was used.

In the evaluation code you can find all learning rates they considered in order to choose the best one:

learning_rates=[1e-5, 2e-5, 5e-5, 1e-4, 2e-4, 5e-4, 1e-3, 2e-3, 5e-3, 1e-2, 2e-2, 5e-2, 0.1],

In the paper, you can find this:

It would be awesome if the results of the grid search are published somewhere. I'm still looking for them.

MohammedSB commented 5 months ago

These learning rate values are for linear probing, not fine-tuning. For fine-tuning, what worked for me was doing a grid search for values {1e-5, 1e-6, 1e-7}, and lower learning rates worked better for larger models.

But I used "standard" fine-tuning recipe as opposed to the one in Beit/Deit, which DINOv2 uses, so you should probably use that one if you are just starting.

amundra15 commented 2 months ago

@MohammedSB can you explain what you mean by "standard" fine-tuning?

MohammedSB commented 2 months ago

I just mean that I did not use layer-wise LR scaling, weight decay, LR warmup, mixup, or any of the other techniques in Deit III fine-tuning recipe, which DINOv2 uses. https://github.com/facebookresearch/deit/blob/main/main.py

I basically just used an LR decay scheduler and that is it.

amundra15 commented 2 months ago

Understood. What was the LR that worked best for you? Also, any chance you tried freezing the initial layers and only fine-tuning the later ones?

MohammedSB commented 2 months ago

You can scan these values {5e-5 1e-5, 5e-6, 1e-6, 1e-7}. Like I said, a lower learning rate worked better for me for larger models. The optimal LR changed from dataset to dataset.

I didn't try freezing initial layers, though I suspect it won't bring much benefit. Honestly, if you want to fine-tune the model just use the deit codebase I shared, it is really strong ViT fine-tuning recipe.

tian1327 commented 1 month ago

Hi all, I run standard finetuning or end-to-end finetuning (i.e. adding a linear classifier atop DINOv2 backbone) and achieved 5% accuracy improvement when finetuning on 16-shot in Semi-Aves dataset. I used the training strategy in the paper SWAT, which uses a larger learning rate of 1e-4 for classifier and a smaller learning rate of 1e-6 for the backbone. This learning rate setting shows to preserve the pretrained features and helps stabilize finetuning.

PMRS-lab commented 1 month ago

With regards to this sentence:

In Table 5, we show that the Top-1 accuracy on the validation set of ImageNet-1k improves by more than +2% when the backbone is fine-tuned.

What was the learning rate used for finetuning the backbone?

May I ask how to obtain the information mentioned by the author /labels.txt file? Thank you very much!