facebookresearch / dinov2

PyTorch code and models for the DINOv2 self-supervised learning method.
Apache License 2.0
9.28k stars 829 forks source link

Is this the right way to fine-tune DINOv2? #276

Open namrahrehman opened 1 year ago

namrahrehman commented 1 year ago

I am trying to finetune dinov2 for image classification on a custom dataset (medical image dataset) with the objective of increasing accuracy. The problem is that when I use linear evaluation I get an adequate accuracy of almost 75%, however when I try to finetune(the whole backbone) I can never get an accuracy higher than 40%, is there something semantically wrong with how I am trying to finetune this model? I even tried it with cifar10 and got an excellent performance on linear evaluation but a poor performance on fine-tuning. Also when I used the model from the hub and ran the following code snippet, I got "Pre-trained DINO weights are not found in the model's state_dict." so instead I had to load the model from hugging face for fine-tuning the whole backbone :

pretrained_dino_keys = [k for k in model.state_dict() if 'dino' in k]

if pretrained_dino_keys:
    print("Pre-trained DINO weights are present in the model's state_dict.")
else:
    print("Pre-trained DINO weights are not found in the model's state_dict.")

the following is my code for fine-tuning:

from transformers import Dinov2ForImageClassification
model = Dinov2ForImageClassification.from_pretrained("facebook/dinov2-small-imagenet1k-1-layer")
for param in model.dinov2.parameters():
    param.requires_grad = True
for param in model.classifier.parameters():
    param.requires_grad = True
# Customize the head for the classification task
num_classes = 10  # Number of classes in the dataset
model.classifier = nn.Linear(768, num_classes).to(device)  a linear layer for classification and move to GPU

# Define the loss function 
loss_fn = nn.CrossEntropyLoss()  

weight_decay = 1e-3 
lr = 0.001
step_size = 5
optimizer = optim.AdamW(model.parameters(), lr=lr, weight_decay=weight_decay)

# Create a learning rate scheduler
scheduler = StepLR(optimizer, step_size=step_size, gamma=0.0001)
def make_classification_eval_transform(
    *,
    resize_size: int = 256,
    interpolation=transforms.InterpolationMode.BICUBIC,
    crop_size: int = 224,
) -> transforms.Compose:
    transforms_list = [
        transforms.Resize(resize_size, interpolation=interpolation),
        transforms.CenterCrop(crop_size),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
    ]
    return transforms.Compose(transforms_list)

# Use the make_classification_eval_transform function to create the transformation pipeline
transform = make_classification_eval_transform()

# Set up data loaders for training, validation, and test
train_dataset = ImageFolder(root=train_dataset_path, transform=transform)
valid_dataset = ImageFolder(root=valid_dataset_path, transform=transform)
test_dataset = ImageFolder(root=test_dataset_path, transform=transform)

# Modify data loading to move data to the same device as the model
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, num_workers=2)
valid_loader = DataLoader(valid_dataset, batch_size=32, shuffle=False, num_workers=2)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False, num_workers=2)
model = model.to(device)
# Set random seed
torch.manual_seed(1)

# Define the number of epochs
num_epochs = 20

# Initialize lists to store loss and accuracy for each epoch
loss_hist_train = [0.0] * num_epochs
accuracy_hist_train = [0.0] * num_epochs
loss_hist_valid = [0.0] * num_epochs
accuracy_hist_valid = [0.0] * num_epochs

for epoch in range(num_epochs):
    model.train()
    loss_accumulated_train = 0.0  # Initialize to zero
    total_samples_train = 0  # Initialize to zero
    correct_predictions_train = 0  # Initialize to zero

    for x_batch, y_batch in train_loader:
        x_batch = x_batch.to(device)
        y_batch = y_batch.to(device)
        output = model(x_batch)
        logits = output.logits
        loss = loss_fn(logits, y_batch)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        loss_accumulated_train += loss.item() * y_batch.size(0)  # Accumulate as a scalar
        total_samples_train += y_batch.size(0)

        # Calculate accuracy
        predicted = torch.max(logits, 1)[1]
        correct_predictions_train += torch.sum(predicted == y_batch).item()  # Accumulate as a scalar

    loss_hist_train[epoch] = loss_accumulated_train / total_samples_train  # Calculate average loss per batch
    accuracy_hist_train[epoch] = correct_predictions_train / total_samples_train  # Calculate accuracy directly

    scheduler.step()

    model.eval()
    with torch.no_grad():
        loss_accumulated_valid = 0.0  # Initialize to zero
        total_samples_valid = 0  # Initialize to zero
        correct_predictions_valid = 0  # Initialize to zero

        for x_batch, y_batch in valid_loader:
            x_batch = x_batch.to(device)
            y_batch = y_batch.to(device)
            output = model(x_batch)
            logits = output.logits
            loss = loss_fn(logits, y_batch)
            loss_accumulated_valid += loss.item() * y_batch.size(0)  # Accumulate as a scalar
            total_samples_valid += y_batch.size(0)

            # Calculate accuracy
            predicted = torch.max(logits, 1)[1]
            correct_predictions_valid += torch.sum(predicted == y_batch).item()  # Accumulate as a scalar

        loss_hist_valid[epoch] = loss_accumulated_valid / total_samples_valid  # Calculate average loss per batch
        accuracy_hist_valid[epoch] = correct_predictions_valid / total_samples_valid  # Calculate accuracy directly

    print(f'Epoch {epoch + 1} accuracy: {accuracy_hist_train[epoch]:.4f} val_accuracy: {accuracy_hist_valid[epoch]:.4f} loss: {loss_hist_train[epoch]:.4f} val_loss: {loss_hist_valid[epoch]:.4f}')
qasfb commented 1 year ago

What accuracy do you get on the training set ?

namrahrehman commented 1 year ago

@qasfb almost the same as validation accuracy. Over fitting is not an issue.

qasfb commented 1 year ago

StepLR(optimizer, step_size=step_size, gamma=0.0001) This multiplies your learning rate by 0.0001 every step-size=5 iterations, is my understanding correct ?

namrahrehman commented 1 year ago

Yes, so the learning rate decreases by a factor of 0.0001.

qasfb commented 1 year ago

I think this is why it doesn't work: after 5 epochs the learning rate essentially becomes 0 Can you try without that scheduling ?

namrahrehman commented 1 year ago

@qasfb I tried you suggestion on cifar10 and following are the results: With Scheduler: image

Without Scheduler: image

I trained without the scheduler for 20 more epochs, though it seems like the accuracy is increasing. Still, overall there is no significant difference in overall accuracy with or without the scheduler. The overall accuracy is in the 20s for both cases. With the scheduler, it converges faster.

Here is a link to the Colab notebook for these experiments if you want to take a detailed look: https://drive.google.com/file/d/1LmFgW-A5VzUeI6haFz7JkwAGCoKiDYxW/view?usp=sharing

jack89roberts commented 11 months ago

In case it's helpful (as I came across your issue whilst trying to debug something myself), I was getting similarly poor performance fine-tuning DINOv2 with the HuggingFace trainer defaults and found it was very sensitive to the learning rate. Reducing the learning rate to 5e-6 (from the default of 5e-5) achieved much better results (slightly better than just training a linear classification head on top of a frozen base model). This was with a linear scheduler on the learning rate in both cases (so starting at the initial values quoted above then reducing during training), which is also the HuggingFace default.

The learning rate you have above is much higher (1e-3), so maybe try something a lot smaller and see what happens?

namrahrehman commented 11 months ago

@jack89roberts I will try your suggestions and post my results here soon. Thank you so much.

If fine-tuning is not possible (or not the objective of the authors) then there needs to be some other way to increase Dinov2's performance with medical imaging data.

twmht commented 11 months ago

@namrahrehman

Any update on this?

lombardata commented 10 months ago

In case it's helpful (as I came across your issue whilst trying to debug something myself), I was getting similarly poor performance fine-tuning DINOv2 with the HuggingFace trainer defaults and found it was very sensitive to the learning rate. Reducing the learning rate to 5e-6 (from the default of 5e-5) achieved much better results (slightly better than just training a linear classification head on top of a frozen base model). This was with a linear scheduler on the learning rate in both cases (so starting at the initial values quoted above then reducing during training), which is also the HuggingFace default.

The learning rate you have above is much higher (1e-3), so maybe try something a lot smaller and see what happens?

Hi @jack89roberts , which dinov2 model dis you use for your training on HF? The Facebook/dinov2 models, the models finetuned on imagenet or the timm/dinov2 models? Do you know the difference between the Facebook and the Timm models? Thank you in advance and have a good day!

jack89roberts commented 10 months ago

I've used only the facebook/dinov2 ones for HuggingFace transformers (specifically facebook/dinov2-small-imagenet1k-1-layer and facebook/dinov2-base-imagenet1k-1-layer). I've not used the timm ones (or the ones downloadable from the repo/torch hub).

lombardata commented 10 months ago

I've used only the facebook/dinov2 ones for HuggingFace transformers (specifically facebook/dinov2-small-imagenet1k-1-layer and facebook/dinov2-base-imagenet1k-1-layer). I've not used the timm ones (or the ones downloadable from the repo/torch hub).

Thank you very much for these informations. So, if I've well understood, you trained all the model (the unfreezed one, backbone+head) starting with a lr = 5e-6 and linearly decreasing the value with the scheduler ? Have a good day!

jack89roberts commented 10 months ago

Yes that's right, just the HF trainer defaults with the lower learning rate basically.

Raspberry-beans commented 10 months ago

@jack89roberts Hi, Can you specify the GPU memory required for this process.

I will be training a linear head (with frozen DINOv2 backbone) on few custom medical images for segmentation. I have only 8GB of GPU memory available. Would it be enough as the backbone will be kept frozen?

Thanks in advance!

jack89roberts commented 10 months ago

You may be better off asking that elsewhere but from a quick look at the training jobs I have run with DINOv2 small/base I think that should be ok yes. I've not used the large/giant variants.

anonymouslei commented 7 months ago

@namrahrehman can you pls share the linear evaluation code? appreciate it!

TIanCat commented 3 months ago

I came across your issue. It performs well in linear probe, but poorly in full model fine-tuning.

tian1327 commented 3 months ago

I think you guys may want to try setting a different learning rate for backbone and classifier. I used 1e-6 for backbone, and 1e-4 for classifier when doing end-to-end finetuning, following the strategy in this paper SWAT. I got better finetuning performance than linear probing on Semi-Aves dataset (5% higher!).

PMRS-lab commented 2 months ago

@qasfb I tried you suggestion on cifar10 and following are the results: With Scheduler: image

Without Scheduler: image

I trained without the scheduler for 20 more epochs, though it seems like the accuracy is increasing. Still, overall there is no significant difference in overall accuracy with or without the scheduler. The overall accuracy is in the 20s for both cases. With the scheduler, it converges faster.

Here is a link to the Colab notebook for these experiments if you want to take a detailed look: https://drive.google.com/file/d/1LmFgW-A5VzUeI6haFz7JkwAGCoKiDYxW/view?usp=sharing Excuse me, how did you run the training code? On the original version of Binov2, I ran the/run/train/train.exe file with the following code and found that it never started training or generated any error messages. python dinov2/run/train/train.py --nodes 1 --config-file dinov2/configs/train/vitl16_short.yaml --output-dir <PATH/TO/OUTPUT/DIR> train.dataset_path=ImageNet:split=TRAIN:root=<PATH/TO/DATASET>:extra=<PATH/TO/DATASET> 1

MirekJara commented 2 months ago

I tried recreating this code and changed a few things, to achieve Accuracy: 1.0000 val_accuracy: 0.9925 loss: 0.0007 val_loss: 0.0389 on 20 epochs (On different dataset, however with the original setup i never passed training accuracy of 0.61)

weight_decay = 1e-4       Just experimented with this
lr = 1e-5                          Lower learning rate, since this is fine-tuning
step_size = 5
optimizer = optim.AdamW(model.parameters(), lr=lr, weight_decay=weight_decay)

# Create a learning rate scheduler
scheduler = StepLR(optimizer, step_size=step_size, gamma=0.1)  Changed gamma value, since it was way too low, and stoped learning completely after 10 epochs (2 steps on which the learning rate is multiplied by gamma, which happens at step 5 and 10)
PMRS-lab commented 2 months ago

I tried recreating this code and changed a few things, to achieve Accuracy: 1.0000 val_accuracy: 0.9925 loss: 0.0007 val_loss: 0.0389 on 20 epochs (On different dataset, however with the original setup i never passed training accuracy of 0.61)

weight_decay = 1e-4       Just experimented with this
lr = 1e-5                          Lower learning rate, since this is fine-tuning
step_size = 5
optimizer = optim.AdamW(model.parameters(), lr=lr, weight_decay=weight_decay)

# Create a learning rate scheduler
scheduler = StepLR(optimizer, step_size=step_size, gamma=0.1)  Changed gamma value, since it was way too low, and stoped learning completely after 10 epochs (2 steps on which the learning rate is multiplied by gamma, which happens at step 5 and 10)

May I ask if during the fine-tuning process, did you only load the weights of the published dinov_2vitl14_pretrain.pth weight files into the teacher network, or did you use the dinov_2vitl14_pretrain.pth weight files to load both the teacher and student networks? I'll try to adjust the fine-tuning experiment as you suggested. Thank you very much!

sipie800 commented 2 months ago

We did do some analyzing and find that dinov2 is more like a binary mask semantic representation model but not a semantic analyzer which most of the pretrain models perform as. The distilled pretraining method has a strong biased modeling for common vision task as its cost, even though it's labeling free as its beneficial. It performances good at zeroshot setting just because it eats much more data than the ImageNet1k models. BUT THAT IS THE BEST IT CAN GO. For out-of-pretraining unseen objects in downsream new tasks, it's worse than any Imagenet1K pretrain ViT. Don't count on dino if you have your own labeled images. Go to supervised pretrain ViT with no hesitation. Dino feature map is not good feature if you want to do fine-grained tasks like fine-grained classification. BTW all ViT are sensitive to hyps. Try to tune lr and weight decay and bs and dropout rate one by one, it's almost a grid search.

The best practise for dino would be: If you have very much unlabeled images for your domain task and find the origin weight can't do decent zeroshot classification. You do a first stage finetuning in dino's origin unsupervised way to get better features, and then do a second stage finetuning with a few labeled images.

This will be true, if your goal is the classifying. the distilling pretraining is not helping no matter how much computation you devote. The direction is biased from the target task.

If you have a good labeled dataset, just go to any Imagenet1K models. Swin or vision RWKV is a good choice. If you insist using a ViT, CLIPs is better than DINO anyway.

PMRS-lab commented 2 months ago

We did do some analyzing and find that dinov2 is more like a binary mask semantic representation model but not a semantic analyzer which most of the pretrain models perform as. The distilled pretraining method has a strong biased modeling for common vision task as its cost, even though it's labeling free as its beneficial. It performances good at zeroshot setting just because it eats much more data than the ImageNet1k models. BUT THAT IS THE BEST IT CAN GO. For out-of-pretraining unseen objects in downsream new tasks, it's worse than any Imagenet1K pretrain ViT. Don't count on dino if you have your own labeled images. Go to supervised pretrain ViT with no hesitation. Dino feature map is not good feature if you want to do fine-grained tasks like fine-grained classification. BTW all ViT are sensitive to hyps. Try to tune lr and weight decay and bs and dropout rate one by one, it's almost a grid search.

The best practise for dino would be: If you have very much unlabeled images for your domain task and find the origin weight can't do decent zeroshot classification. You do a first stage finetuning in dino's origin unsupervised way to get better features, and then do a second stage finetuning with a few labeled images.

This will be true, if your goal is the classifying. the distilling pretraining is not helping no matter how much computation you devote. The direction is biased from the target task.

If you have a good labeled dataset, just go to any Imagenet1K models. Swin or vision RWKV is a good choice. If you insist using a ViT, CLIPs is better than DINO anyway.

Hello, I completely agree with what you said, but I'm not quite sure if you're referring to using supervised models with distillation learning in other VIT models to achieve better results than DINOv2? Because VIT model training requires a large amount of memory space, or do you mean using supervised model training to achieve better feature extraction results with models like Swin Transformer? Thank you very much!