chaoyuaw / pytorch-coviar

Compressed Video Action Recognition
https://www.cs.utexas.edu/~cywu/projects/coviar/
GNU Lesser General Public License v2.1
500 stars 126 forks source link

Testing with 25 segments is not like paper #59

Closed Esaada closed 5 years ago

Esaada commented 5 years ago

Hi, in the paper you guys said, that you're sampling 25 frames uniformly. for what I understood , let's say the number of frames is 100, you sample 25 indexes. and those are the frames. in this method I understand why you get 4.2 GFLOPs. But in the code it doesn't look like that, in: https://github.com/chaoyuaw/pytorch-coviar/blob/master/dataset.py#L130 you are running on the number of segemnts, and you calling coviar for each of the reprsentation, I mean you're calling this module for I frames residual and for mv, so you choose in a head how much to sample from each type? I'm a bit confused on this testing method and the code, thanks for your elp.

chaoyuaw commented 5 years ago

Hi Barak,

Sorry for the confusion.

I think part of the confusion comes from the evaluation protocol we follow. The commonly used protocol samples a fixed number of (e.g. 25) "locations" in a test video. However, in practice, the actual number of frames used in different methods might not be strictly the same. For example, TSN samples 25 RGB frames + 25 optical flow stacks, so the actual frames used are actually > 250 frames. For 3D CNNs, eg. non-local nets, the final prediction are calculated by averaging of "clips" , each of which already has 32 frames. But overall, all these methods try to sample a fixed number of "locations" in a test video. Our method does the same. That is, we sample 25 locations for each modality, and combine, similar to what 2-stream networks do. Does it make sense?

Esaada commented 5 years ago

Got it, Thanks.

Esaada commented 5 years ago

I'm sorry I have one small question I just read dmc-net which cite your paper a lot, and I saw this line at section 4.2:"25 frames are uniformly sampled for each video; each sampled frame has 5 crops augmented with flipping; all 250 (25×2×5) score predictions are averaged to obtain one video-level prediction.” as I understood from you and from the code, you guys are using 75 frames, 25 from each kind, feed 25 to resnet152 and 50 to resnet18. Now I noticed that you're feeding to the network the data with augmentation (at inference), and I can't understand why is it 2525 and not 2535? in the final count, how many time does resnet18&resnet152 are working in order to predict a video? that gave the accuracy you mentioned in the paper of course I'm all confused