Question about Training

In the paper, you mentioned that "Due to GPU memory constraints, we train our model by randomly sampling 5 frames per video sequence."

From my understanding though, the number of frames should not be much of a bottleneck with current architecture and memory design.

For example, when 6th frame is given, the previous hidden states are supposed to be fixed (and have been updated), as they do not depend on the future image frames.

So technically saying, I think we could infinitely extend the sequence length during training.

Is my understanding correct?

HengyiWang / spann3r

Question about Training #33