Provide an inference example

Hi!

When you look for inference do you refer to using the fine-tuned decoders (+ LoRA) on Pascal VOC and ADE20k using the ViT-L DINOv2 weights? In explanation.ipynb in the root of the project I have some examples of how to use this combination. For example,

import torch
from dino_finetune import DINOV2EncoderLoRA

encoder = torch.hub.load(repo_or_dir="facebookresearch/dinov2", model=f"dinov2_vitl14_reg").cuda()
dino_lora = DINOV2EncoderLoRA(
    encoder=encoder,
    r=3, # These are the same settings used in training
    emb_dim=1024, # The large ViT embedding dim
    img_dim=(308, 308), # For ease of use rescaling to a valid patch dimension 
    n_classes=21, # Number of classes in pascal VOC
    use_lora=True,
).cuda()

dino_lora.load_parameters("output/base_voc_lora.pt")
dino_lora.eval()

logits = dino_lora(torch.randn(1, 3, 308, 308).cuda().float())
y_hat = torch.argmax(torch.sigmoid(logits), dim=1)
y_hat.shape # (1, 308, 308)

RobvanGastel / dinov2-finetune

Provide an inference example #5