artest08 / LateTemporalModeling3DCNN

MIT License
159 stars 52 forks source link

Paper and code inconsistent? #6

Closed wjtan99 closed 3 years ago

wjtan99 commented 4 years ago

Hi, I am reading your paper and code in the past few days. I found the code and the paper are inconsistent. One big part of the paper is the removal of the Temporal Global Average Pooling in Figure 1. For example, in your rgb_I3D.py code, in the model rgb_I3D64f_bert2, the input dimension to 3DCNN is batch x 3 x 64 x 224 x 224, the output is batch x 1024 x 8 x 7 x 7. Then you apply another 3D pooling to get batch x 1024 x 8 x 1 x 1. In my understanding, the temporal pooling is already done in the 3DCNN. In Figure 1 of your paper, you remove the Temporal Global Average Pooling and the output of the 3DCNN still has f1,f2,...,fN. But in your code, there is no such N frame features. Can you help me understand your code and paper?
Thanks a lot.

AfrinaVT commented 3 years ago

What I understood is that they focused on the average pooling in the temporal dimension not the number of frames. In the last average pooling layer a kernel size of (1,7,7) in the bert. So no average pooling was used in the temporal dimension. it is not necessary that the temporal dimension will be equal to frame number. thats my understanding

artest08 commented 3 years ago

Hello, first of all, thanks for the interest and the valuable comments.

For the motivation, in all of the 3D CNN architectures temporal global average pooling layer (TGAP) is utilized. The important fact behind the motivation of this study is that features before TGAP layer of 3D CNNS has different temporal characteristic because although the receptive field of different temporal features might cover the whole clip, the effective receptive field has a Gaussian distribution which means that the effect of each frame on different temporal features is not equal.

For this reason, we are considering that TGAP decreases the richness of temporal information by weighting different temporal features equally. This is problematic because one of the temporal feature might be more important than the others to recognize an action. Additionally it destroys the temporal order information which can be important to recognize some actions.

To give an example, at the end of resnext architecture, after spatial average pooling, the final dimension is 4x2048, where 4 denotes the temporal dimension. Global average pooling directly reduces this to 1x2048 by equal weighting. But BERT applies attention mechanism in order to choose important temporal part. With mutihead attention, it is possible to apply different temporal weights to different parts of the feature dimension.

wjtan99 commented 3 years ago

@AfrinaVT Thank you for pointing out this key difference of temporal dimension and frame. In my previous understanding, the N is the number of frames, which was incorrect. In the paper, it says K frames from the input sequence are propagated through a 3D CNN.

@artest08 Thanks for your explanation. I misunderstood the temporal dimension. In your paper, if you can add one sentence explaining the N in Figure 1, it will be great.