How can DINO v2 be utilized for downstream tasks involving video data?

seekerhuang commented 1 year ago

Truly a nice work! Video data often necessitates incorporating the temporal dimension into the network architecture to account for its sequential frames. However, DINO v2 currently lacks provisions for managing the temporal dimension. Creating a custom version of DINO v2 with temporal processing becomes challenging due to the mismatch between the state dictionaries, making it difficult to load the pretrained model.

What approach should be adopted to enable DINO v2 to have temporal processing capabilities while still being able to load the pretrained weights?

Thank you very much for your response.

MohammedSB commented 1 year ago

There are multiple different techniques to do it, you take a look at this: https://collab.dvb.bayern/display/TUMdlma/Converting+weights+of+2D+Vision+Transformer+for+3D+Image+Classification Assuming you want to incorporate temporal information in the model, and not process each frame independently, you can try the weight inflation technique mention in the article.

ander008 commented 2 months ago

Hello! Could you address this problem? I encountered the same confusion.

facebookresearch / dinov2

How can DINO v2 be utilized for downstream tasks involving video data? #229