Closed Claire874 closed 1 year ago
Tagging @abhi-glitchhg who is better suited to help you with this.
Hey @Claire874 , glad you find this repo useful.
Is the value of the num_frame fixed or not
Yes, the value for a particular model will be fixed. Passing down a video with different num_frames
will result in dimensionality errors as we add the positional embeddings to the token embeddings. and dimensions of the positional embeddings are dependent on the num_frames
.
does the model process each frame one by one
No, the ViViT model doesn't process the frames one by one. There are two ways of converting video into Embeddings
1) Union Sampling Method
2) Tubelet Embeddings
For a single forward pass, both methods consider videoes of length num_frames
(Or a video of length T in the above diagrams.) x.
I hope I've answered your question. Let me know. Thanks!
In the paper authors have done ablation studies with varying the number of input frames. You might want to go through that as well.
Thanks for the detailed response, which is very helpful to me.
For the different sequence lengths, is it ok to do the padding and mask padding to them? Is there any available code in vformer? 👍
Zero padding or video interpolations could be the solution for your problem; Or instead of considering all frames, consider the views as mentioned in the ablation studies. In this way, you would have control over the number of frames/views, and it would reduce the memory footprints. Sadly we don't have the code for the above operations in vformer. Scenic- official JAX implementation might have code for the above operations in data/
directory. (sorry for redirecting you to jax codebase :) )
@Claire874 I'll close this issue now. If you have any further questions, please feel free to re-open or create another issue.
Thanks for the great work in ViViT model 2. Is the value of the num_frame fixed or not? Or does the model process each frame one by one?