Tensor sizes input in ConvNet and RNN

Thanks for the code. I have a couple of questions regarding tensor sizes.

1) The dataloader creates tensors size X= (#videos, #frames, 3, H, W) and y=(#videos, 1). There's a loop in the train method for #videos, but in my implementation it only returned index=0, so the input in the ConvNet is size (#videos, #frames, 3, H, W). Is this correct?

2) In the ConvNet's forward method there's a loop for #frames in the video, it transforms the pool layer into a vector to get tensor (#videos, #frames, CNN_embed_dim), which is both the output of the ConvNet and input in the RNN. Is this right?

I don't quite understand how the RNN processes batch, i.e. the number of videos. Is there some internal loop for this that I can't find in the code?

HHTseng / video-classification

Tensor sizes input in ConvNet and RNN #12