Open seekerhuang opened 1 year ago
There are multiple different techniques to do it, you take a look at this: https://collab.dvb.bayern/display/TUMdlma/Converting+weights+of+2D+Vision+Transformer+for+3D+Image+Classification Assuming you want to incorporate temporal information in the model, and not process each frame independently, you can try the weight inflation technique mention in the article.
Hello! Could you address this problem? I encountered the same confusion.
Truly a nice work! Video data often necessitates incorporating the temporal dimension into the network architecture to account for its sequential frames. However, DINO v2 currently lacks provisions for managing the temporal dimension. Creating a custom version of DINO v2 with temporal processing becomes challenging due to the mismatch between the state dictionaries, making it difficult to load the pretrained model.
What approach should be adopted to enable DINO v2 to have temporal processing capabilities while still being able to load the pretrained weights?
Thank you very much for your response.