In the paper, you mentioned that "Due to GPU memory constraints, we train our model by randomly sampling 5 frames per video sequence."
From my understanding though, the number of frames should not be much of a bottleneck with current architecture and memory design.
For example, when 6th frame is given, the previous hidden states are supposed to be fixed (and have been updated), as they do not depend on the future image frames.
So technically saying, I think we could infinitely extend the sequence length during training.
Hi @resurgo97, in training, you need gradient backpropagation and the number of memory tokens increases over time, so you may not be able to infinitely extend the sequence length.
In the paper, you mentioned that "Due to GPU memory constraints, we train our model by randomly sampling 5 frames per video sequence."
From my understanding though, the number of frames should not be much of a bottleneck with current architecture and memory design.
For example, when 6th frame is given, the previous hidden states are supposed to be fixed (and have been updated), as they do not depend on the future image frames.
So technically saying, I think we could infinitely extend the sequence length during training.
Is my understanding correct?