HHTseng / video-classification

Tutorial for video classification/ action recognition using 3D CNN/ CNN+RNN on UCF101
941 stars 216 forks source link

Tensor sizes input in ConvNet and RNN #12

Open AlexTS1980 opened 5 years ago

AlexTS1980 commented 5 years ago

Thanks for the code. I have a couple of questions regarding tensor sizes.

1) The dataloader creates tensors size X= (#videos, #frames, 3, H, W) and y=(#videos, 1). There's a loop in the train method for #videos, but in my implementation it only returned index=0, so the input in the ConvNet is size (#videos, #frames, 3, H, W). Is this correct?

2) In the ConvNet's forward method there's a loop for #frames in the video, it transforms the pool layer into a vector to get tensor (#videos, #frames, CNN_embed_dim), which is both the output of the ConvNet and input in the RNN. Is this right?

I don't quite understand how the RNN processes batch, i.e. the number of videos. Is there some internal loop for this that I can't find in the code?