Thanks for the code. I have a couple of questions regarding tensor sizes.
1) The dataloader creates tensors size X= (#videos, #frames, 3, H, W) and y=(#videos, 1). There's a loop in the train method for #videos, but in my implementation it only returned index=0, so the input in the ConvNet is size (#videos, #frames, 3, H, W). Is this correct?
2) In the ConvNet's forward method there's a loop for #frames in the video, it transforms the pool layer into a vector to get tensor (#videos, #frames, CNN_embed_dim), which is both the output of the ConvNet and input in the RNN. Is this right?
I don't quite understand how the RNN processes batch, i.e. the number of videos. Is there some internal loop for this that I can't find in the code?
Thanks for the code. I have a couple of questions regarding tensor sizes.
1) The dataloader creates tensors size
X= (#videos, #frames, 3, H, W) and y=(#videos, 1)
. There's a loop in thetrain
method for #videos, but in my implementation it only returned index=0, so the input in the ConvNet is size(#videos, #frames, 3, H, W)
. Is this correct?2) In the ConvNet's
forward
method there's a loop for #frames in the video, it transforms the pool layer into a vector to get tensor (#videos, #frames,CNN_embed_dim
), which is both the output of the ConvNet and input in the RNN. Is this right?I don't quite understand how the RNN processes batch, i.e. the number of videos. Is there some internal loop for this that I can't find in the code?