SforAiDl / vformer

A modular PyTorch library for vision transformer models
https://vformer.readthedocs.io/
MIT License
162 stars 22 forks source link

Is the value of the num_frame in Video Transformer fixed or not #102

Closed Claire874 closed 1 year ago

Claire874 commented 1 year ago

Thanks for the great work in ViViT model 2. Is the value of the num_frame fixed or not? Or does the model process each frame one by one?

NeelayS commented 1 year ago

Tagging @abhi-glitchhg who is better suited to help you with this.

abhi-glitchhg commented 1 year ago

Hey @Claire874 , glad you find this repo useful.

Is the value of the num_frame fixed or not

Yes, the value for a particular model will be fixed. Passing down a video with different num_frames will result in dimensionality errors as we add the positional embeddings to the token embeddings. and dimensions of the positional embeddings are dependent on the num_frames.

https://github.com/SforAiDl/vformer/blob/b0370f9bda23d9da9fee5c14eaa5137eed936e0d/vformer/models/classification/vivit.py#L83-L86

https://github.com/SforAiDl/vformer/blob/b0370f9bda23d9da9fee5c14eaa5137eed936e0d/vformer/models/classification/vivit.py#L133

does the model process each frame one by one

No, the ViViT model doesn't process the frames one by one. There are two ways of converting video into Embeddings 1) Union Sampling Method image

2) Tubelet Embeddings image

For a single forward pass, both methods consider videoes of length num_frames (Or a video of length T in the above diagrams.) x.

I hope I've answered your question. Let me know. Thanks!

abhi-glitchhg commented 1 year ago

In the paper authors have done ablation studies with varying the number of input frames. You might want to go through that as well.

image

Claire874 commented 1 year ago

Thanks for the detailed response, which is very helpful to me.

Claire874 commented 1 year ago

For the different sequence lengths, is it ok to do the padding and mask padding to them? Is there any available code in vformer? 👍

abhi-glitchhg commented 1 year ago

Zero padding or video interpolations could be the solution for your problem; Or instead of considering all frames, consider the views as mentioned in the ablation studies. In this way, you would have control over the number of frames/views, and it would reduce the memory footprints. Sadly we don't have the code for the above operations in vformer. Scenic- official JAX implementation might have code for the above operations in data/ directory. (sorry for redirecting you to jax codebase :) )

NeelayS commented 1 year ago

@Claire874 I'll close this issue now. If you have any further questions, please feel free to re-open or create another issue.