AssafSinger94 / dino-tracker

Official Pytorch Implementation for “DINO-Tracker: Taming DINO for Self-Supervised Point Tracking in a Single Video”
MIT License
361 stars 39 forks source link

model weights ? #9

Closed javiabellan closed 2 months ago

javiabellan commented 3 months ago

I would like to locate where are the weights:

frozen_dino

i can see on this line and this line that frozen dino comes from ./dataset/libby/dino_embeddings/dino_embed_video.pt". However i can not locate the libby dataset (i only searched on the dropbox)

delta_dino & tracker_head

I can see the fot the delta_dino and tracker_head are on dropbox. However i see for each video there is a differnt .pt file. I've compared those files between them (diff & cmp cli tools) and for each video the weights are different. Where i can find the best/final weights if there are any?

Captura de pantalla 2024-05-22 a las 13 26 57

tnarek commented 3 months ago

hi @javiabellan , thanks for your questions.

  1. As an example video, we have put the "horsejump" video from davis, but the code remained with "libby". Sorry for this inconvenience. You may change the path to /dataset/horsejump/dino_embeddings/dino_embed_video.pt and it should work. If you want the libby video, it is included in dropbox DAVIS zip file under index of 25.
  2. DINO-Tracker is an optimization-based method, and therefore is trained for a specific video, i.e. each video has its own trained weights. So if you want to test on new videos, you need to train on them.
javiabellan commented 3 months ago

Thanks for your response,

1)

Sorry, i still dont know where is /dataset/horsejump/ all i can see on dropbox is horse jumping video located at BADJA/5.

Becuase you are using a frozen DINOv2 maybe its easier for me obtain it from torch hub (all i need to know which one are you using):

# DINOv2
dinov2_vits14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14')
dinov2_vitb14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14')
dinov2_vitl14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14')
dinov2_vitg14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14')

# DINOv2 with registers
dinov2_vits14_reg = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14_reg')
dinov2_vitb14_reg = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14_reg')
dinov2_vitl14_reg = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14_reg')
dinov2_vitg14_reg = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14_reg')

2)

I didn't know that DINO-Tracker was trained for a specific video. Now makes sense all the different weights.

Idea: Wouldn't be nice a sigle DINO-tracker weights that generalize for any unseen video given the input points at the first frame (manually picked or a uniform grid)? Is this idea currently possible with DINO-tracker or would require an additional training?

tnarek commented 3 months ago
  1. dataset/horsejump only contains the video and the foreground masks. You need to run the preprocessing script to extract its embeddings (or the embeddings of any video). This script loads DINOv2 weights from torchhub and saves the DINOv2 embeddings of all the video frames.
  2. This is definitely an interesting research direction. As a simple extension, one can apply the same preprocessing (optical-flow and DINO best-buddies) and self-supervised losses to train DINO-Tracker on a dataset of unlabeled videos.
javiabellan commented 3 months ago

At the end this work is very similar to DODUO. They train a generic matcher but with low resolution (super-pixel). What i really like from dino-tracker is the sub-pixel resolution of the feature corresponcences (matches). If i did understand correctly, in dino tracker you can make a very dense grid of query points, and dino-tracker produces the matches (even for long-term separated frames).

tnarek commented 3 months ago

Thanks for pointing to this work! It indeed seems relevant. However, some key differences are:

  1. DINO-Tracker is trained on a single video, while DUODUO is trained on a dataset
  2. The approach of DINO-Tracker is refining DINO features in a lightweight manner, while in DUODUO, it seems that they train their feature extractor from scratch, and use DINO feature similarities only when estimating the flow field.

And you are correct, DINO-Tracker can track points densely in all of video frames.