Open eypros opened 3 days ago
Hi!
When you look for inference do you refer to using the fine-tuned decoders (+ LoRA) on Pascal VOC and ADE20k using the ViT-L DINOv2 weights? In explanation.ipynb
in the root of the project I have some examples of how to use this combination. For example,
import torch
from dino_finetune import DINOV2EncoderLoRA
encoder = torch.hub.load(repo_or_dir="facebookresearch/dinov2", model=f"dinov2_vitl14_reg").cuda()
dino_lora = DINOV2EncoderLoRA(
encoder=encoder,
r=3, # These are the same settings used in training
emb_dim=1024, # The large ViT embedding dim
img_dim=(308, 308), # For ease of use rescaling to a valid patch dimension
n_classes=21, # Number of classes in pascal VOC
use_lora=True,
).cuda()
dino_lora.load_parameters("output/base_voc_lora.pt")
dino_lora.eval()
logits = dino_lora(torch.randn(1, 3, 308, 308).cuda().float())
y_hat = torch.argmax(torch.sigmoid(logits), dim=1)
y_hat.shape # (1, 308, 308)
I am interested in utilizing your work for a project. So, the actual use include training to a custom dataset and using the model to infer masks afterwards. I couldn't find any direct information regarding the inference information.
My question is am I supposed to follow the original Dinov2 approach for inference or should I deduce the approach using the evaluation code you provide (inside the train pipeline that is)?
Can you provide a minimal, functional example for inference?