HHTseng / video-classification

Tutorial for video classification/ action recognition using 3D CNN/ CNN+RNN on UCF101
936 stars 216 forks source link

Problem with ResnetCRNN_varlen #39

Open glmanhtu opened 3 years ago

glmanhtu commented 3 years ago

Hello @HHTseng,

I find it really interesting in your code for better understanding both 3D and CNN + LSTM architecture. However, I think there is a small problem when you handling various lengths in the LSTM part. As we have some videos with minimum of 28 frames and you have padded it to make sure they are all have 50 frames. However, when you decode the LSTM hidden units, you take the last frame: https://github.com/HHTseng/video-classification/blob/master/ResNetCRNN_varylength/functions.py#L276 which will be zeros in these cases.

I think we have to rely on the second output of torch.nn.utils.rnn.pad_packed_sequence to decide which timestep to decode for classification,

Please let me know your opinion,