SRA2 / SPELL

Learning Long-Term Spatial-Temporal Graphs for Active Speaker Detection (ECCV 2022)
MIT License
65 stars 9 forks source link

Question about 0.7 GFLOPs for visual feature encoding reported in paper #5

Open SJTUwxz opened 1 year ago

SJTUwxz commented 1 year ago

Thank you for sharing this work! I have a question about how 0.7 GFLOPs is computed for visual feature encoding in the paper. I use the 2D ResNet-18+TSM that is shared in models_stage1_tsm.py, and feed input of shape (11,3,144,144) which is a stack of 11 consecutive face crops of resolution 144x144. And I got 8.73 GFLOPs. I use this tool to compute GFLOPs: https://github.com/facebookresearch/fvcore/blob/main/docs/flop_count.md.

kylemin commented 1 year ago

Hi,

Thank you for your interest in our work!

We consulted one of the previous works (ASDNet). Specifically, Table 3 of the paper compares 2D vs 3D ResNet series for the visual backbones. Because it is calculated for the resolution of 160x160, 2D-ResNet-18 would have 0.7 GFLOPs for 144x144. According to the TSM paper, theoretically, 2D-ResNet-18-TSM should have the same computational requirements. I think the temporal shift operations might increase the computation when their implementation is not optimized. I wonder if you also computed the GFLOPs of ASDNet's 3D-ResNeXt-101?

Thank you, Kyle

SJTUwxz commented 1 year ago

Hi Kyle,

Thank you for the quick response!

I've computed the GFLOPs of ASDNet's 3D-ResNeXt-101 given input of (8,3,160,160), which is a stack of 8 face crops of resolution 160*160. And the GFLOPs 13.58, which is similar to their reported number.

I wonder for SPELL, is the input a stack of 11 consecutive face crops of resolution 144*144? After feeding this input to 2D-ResNet-18-TSM encoder, it extracts feature for each of the 11 face crops, and the 11 output features are averaged to get the final feature of the center face?

Also is the models_stage1_tsm.py the feature extraction model you are using for SPELL?

Thank you so much!

kylemin commented 1 year ago

Hi again,

I see. Thank you for the information. Yes exactly! And yes, we used models_stage1_tsm.py.

Thank you, Kyle

SJTUwxz commented 1 year ago

Thank you and sorry for late reply!

The reported 0.7 GFlops on page 3 of the paper is for one face crop image of size (1,3,144,144) feeding into 2D ResNet-18-TSM, not a stack of 11 face crops, right?

Thanks again!