facebookresearch / dinov2

PyTorch code and models for the DINOv2 self-supervised learning method.
Apache License 2.0
8.78k stars 761 forks source link

How can DINO v2 be utilized for downstream tasks involving video data? #229

Open seekerhuang opened 11 months ago

seekerhuang commented 11 months ago

Truly a nice work! Video data often necessitates incorporating the temporal dimension into the network architecture to account for its sequential frames. However, DINO v2 currently lacks provisions for managing the temporal dimension. Creating a custom version of DINO v2 with temporal processing becomes challenging due to the mismatch between the state dictionaries, making it difficult to load the pretrained model.

What approach should be adopted to enable DINO v2 to have temporal processing capabilities while still being able to load the pretrained weights?

Thank you very much for your response.

MohammedSB commented 11 months ago

There are multiple different techniques to do it, you take a look at this: https://collab.dvb.bayern/display/TUMdlma/Converting+weights+of+2D+Vision+Transformer+for+3D+Image+Classification Assuming you want to incorporate temporal information in the model, and not process each frame independently, you can try the weight inflation technique mention in the article.

ander008 commented 2 days ago

Hello! Could you address this problem? I encountered the same confusion.