Open namrahrehman opened 1 year ago
What accuracy do you get on the training set ?
@qasfb almost the same as validation accuracy. Over fitting is not an issue.
StepLR(optimizer, step_size=step_size, gamma=0.0001) This multiplies your learning rate by 0.0001 every step-size=5 iterations, is my understanding correct ?
Yes, so the learning rate decreases by a factor of 0.0001.
I think this is why it doesn't work: after 5 epochs the learning rate essentially becomes 0 Can you try without that scheduling ?
@qasfb I tried you suggestion on cifar10 and following are the results: With Scheduler:
Without Scheduler:
I trained without the scheduler for 20 more epochs, though it seems like the accuracy is increasing. Still, overall there is no significant difference in overall accuracy with or without the scheduler. The overall accuracy is in the 20s for both cases. With the scheduler, it converges faster.
Here is a link to the Colab notebook for these experiments if you want to take a detailed look: https://drive.google.com/file/d/1LmFgW-A5VzUeI6haFz7JkwAGCoKiDYxW/view?usp=sharing
In case it's helpful (as I came across your issue whilst trying to debug something myself), I was getting similarly poor performance fine-tuning DINOv2 with the HuggingFace trainer defaults and found it was very sensitive to the learning rate. Reducing the learning rate to 5e-6 (from the default of 5e-5) achieved much better results (slightly better than just training a linear classification head on top of a frozen base model). This was with a linear scheduler on the learning rate in both cases (so starting at the initial values quoted above then reducing during training), which is also the HuggingFace default.
The learning rate you have above is much higher (1e-3), so maybe try something a lot smaller and see what happens?
@jack89roberts I will try your suggestions and post my results here soon. Thank you so much.
If fine-tuning is not possible (or not the objective of the authors) then there needs to be some other way to increase Dinov2's performance with medical imaging data.
@namrahrehman
Any update on this?
In case it's helpful (as I came across your issue whilst trying to debug something myself), I was getting similarly poor performance fine-tuning DINOv2 with the HuggingFace trainer defaults and found it was very sensitive to the learning rate. Reducing the learning rate to 5e-6 (from the default of 5e-5) achieved much better results (slightly better than just training a linear classification head on top of a frozen base model). This was with a linear scheduler on the learning rate in both cases (so starting at the initial values quoted above then reducing during training), which is also the HuggingFace default.
The learning rate you have above is much higher (1e-3), so maybe try something a lot smaller and see what happens?
Hi @jack89roberts , which dinov2 model dis you use for your training on HF? The Facebook/dinov2 models, the models finetuned on imagenet or the timm/dinov2 models? Do you know the difference between the Facebook and the Timm models? Thank you in advance and have a good day!
I've used only the facebook/dinov2
ones for HuggingFace transformers (specifically facebook/dinov2-small-imagenet1k-1-layer and facebook/dinov2-base-imagenet1k-1-layer). I've not used the timm
ones (or the ones downloadable from the repo/torch hub).
I've used only the
facebook/dinov2
ones for HuggingFace transformers (specifically facebook/dinov2-small-imagenet1k-1-layer and facebook/dinov2-base-imagenet1k-1-layer). I've not used thetimm
ones (or the ones downloadable from the repo/torch hub).
Thank you very much for these informations. So, if I've well understood, you trained all the model (the unfreezed one, backbone+head) starting with a lr = 5e-6 and linearly decreasing the value with the scheduler ? Have a good day!
Yes that's right, just the HF trainer defaults with the lower learning rate basically.
@jack89roberts Hi, Can you specify the GPU memory required for this process.
I will be training a linear head (with frozen DINOv2 backbone) on few custom medical images for segmentation. I have only 8GB of GPU memory available. Would it be enough as the backbone will be kept frozen?
Thanks in advance!
You may be better off asking that elsewhere but from a quick look at the training jobs I have run with DINOv2 small/base I think that should be ok yes. I've not used the large/giant variants.
@namrahrehman can you pls share the linear evaluation code? appreciate it!
I came across your issue. It performs well in linear probe, but poorly in full model fine-tuning.
I think you guys may want to try setting a different learning rate for backbone and classifier. I used 1e-6 for backbone, and 1e-4 for classifier when doing end-to-end finetuning, following the strategy in this paper SWAT. I got better finetuning performance than linear probing on Semi-Aves dataset (5% higher!).
@qasfb I tried you suggestion on cifar10 and following are the results: With Scheduler:
Without Scheduler:
I trained without the scheduler for 20 more epochs, though it seems like the accuracy is increasing. Still, overall there is no significant difference in overall accuracy with or without the scheduler. The overall accuracy is in the 20s for both cases. With the scheduler, it converges faster.
Here is a link to the Colab notebook for these experiments if you want to take a detailed look: https://drive.google.com/file/d/1LmFgW-A5VzUeI6haFz7JkwAGCoKiDYxW/view?usp=sharing Excuse me, how did you run the training code? On the original version of Binov2, I ran the/run/train/train.exe file with the following code and found that it never started training or generated any error messages. python dinov2/run/train/train.py --nodes 1 --config-file dinov2/configs/train/vitl16_short.yaml --output-dir <PATH/TO/OUTPUT/DIR> train.dataset_path=ImageNet:split=TRAIN:root=<PATH/TO/DATASET>:extra=<PATH/TO/DATASET>
I tried recreating this code and changed a few things, to achieve Accuracy: 1.0000 val_accuracy: 0.9925 loss: 0.0007 val_loss: 0.0389
on 20 epochs (On different dataset, however with the original setup i never passed training accuracy of 0.61)
weight_decay = 1e-4 Just experimented with this
lr = 1e-5 Lower learning rate, since this is fine-tuning
step_size = 5
optimizer = optim.AdamW(model.parameters(), lr=lr, weight_decay=weight_decay)
# Create a learning rate scheduler
scheduler = StepLR(optimizer, step_size=step_size, gamma=0.1) Changed gamma value, since it was way too low, and stoped learning completely after 10 epochs (2 steps on which the learning rate is multiplied by gamma, which happens at step 5 and 10)
I tried recreating this code and changed a few things, to achieve
Accuracy: 1.0000 val_accuracy: 0.9925 loss: 0.0007 val_loss: 0.0389
on 20 epochs (On different dataset, however with the original setup i never passed training accuracy of 0.61)weight_decay = 1e-4 Just experimented with this lr = 1e-5 Lower learning rate, since this is fine-tuning step_size = 5 optimizer = optim.AdamW(model.parameters(), lr=lr, weight_decay=weight_decay) # Create a learning rate scheduler scheduler = StepLR(optimizer, step_size=step_size, gamma=0.1) Changed gamma value, since it was way too low, and stoped learning completely after 10 epochs (2 steps on which the learning rate is multiplied by gamma, which happens at step 5 and 10)
May I ask if during the fine-tuning process, did you only load the weights of the published dinov_2vitl14_pretrain.pth weight files into the teacher network, or did you use the dinov_2vitl14_pretrain.pth weight files to load both the teacher and student networks? I'll try to adjust the fine-tuning experiment as you suggested. Thank you very much!
We did do some analyzing and find that dinov2 is more like a binary mask semantic representation model but not a semantic analyzer which most of the pretrain models perform as. The distilled pretraining method has a strong biased modeling for common vision task as its cost, even though it's labeling free as its beneficial. It performances good at zeroshot setting just because it eats much more data than the ImageNet1k models. BUT THAT IS THE BEST IT CAN GO. For out-of-pretraining unseen objects in downsream new tasks, it's worse than any Imagenet1K pretrain ViT. Don't count on dino if you have your own labeled images. Go to supervised pretrain ViT with no hesitation. Dino feature map is not good feature if you want to do fine-grained tasks like fine-grained classification. BTW all ViT are sensitive to hyps. Try to tune lr and weight decay and bs and dropout rate one by one, it's almost a grid search.
The best practise for dino would be: If you have very much unlabeled images for your domain task and find the origin weight can't do decent zeroshot classification. You do a first stage finetuning in dino's origin unsupervised way to get better features, and then do a second stage finetuning with a few labeled images.
This will be true, if your goal is the classifying. the distilling pretraining is not helping no matter how much computation you devote. The direction is biased from the target task.
If you have a good labeled dataset, just go to any Imagenet1K models. Swin or vision RWKV is a good choice. If you insist using a ViT, CLIPs is better than DINO anyway.
We did do some analyzing and find that dinov2 is more like a binary mask semantic representation model but not a semantic analyzer which most of the pretrain models perform as. The distilled pretraining method has a strong biased modeling for common vision task as its cost, even though it's labeling free as its beneficial. It performances good at zeroshot setting just because it eats much more data than the ImageNet1k models. BUT THAT IS THE BEST IT CAN GO. For out-of-pretraining unseen objects in downsream new tasks, it's worse than any Imagenet1K pretrain ViT. Don't count on dino if you have your own labeled images. Go to supervised pretrain ViT with no hesitation. Dino feature map is not good feature if you want to do fine-grained tasks like fine-grained classification. BTW all ViT are sensitive to hyps. Try to tune lr and weight decay and bs and dropout rate one by one, it's almost a grid search.
The best practise for dino would be: If you have very much unlabeled images for your domain task and find the origin weight can't do decent zeroshot classification. You do a first stage finetuning in dino's origin unsupervised way to get better features, and then do a second stage finetuning with a few labeled images.
This will be true, if your goal is the classifying. the distilling pretraining is not helping no matter how much computation you devote. The direction is biased from the target task.
If you have a good labeled dataset, just go to any Imagenet1K models. Swin or vision RWKV is a good choice. If you insist using a ViT, CLIPs is better than DINO anyway.
Hello, I completely agree with what you said, but I'm not quite sure if you're referring to using supervised models with distillation learning in other VIT models to achieve better results than DINOv2? Because VIT model training requires a large amount of memory space, or do you mean using supervised model training to achieve better feature extraction results with models like Swin Transformer? Thank you very much!
I am trying to finetune dinov2 for image classification on a custom dataset (medical image dataset) with the objective of increasing accuracy. The problem is that when I use linear evaluation I get an adequate accuracy of almost 75%, however when I try to finetune(the whole backbone) I can never get an accuracy higher than 40%, is there something semantically wrong with how I am trying to finetune this model? I even tried it with cifar10 and got an excellent performance on linear evaluation but a poor performance on fine-tuning. Also when I used the model from the hub and ran the following code snippet, I got "Pre-trained DINO weights are not found in the model's state_dict." so instead I had to load the model from hugging face for fine-tuning the whole backbone :
the following is my code for fine-tuning: