facebookresearch / AVT

Code release for ICCV 2021 paper "Anticipative Video Transformer"
Apache License 2.0
151 stars 28 forks source link

Question about Object/Image Features #9

Closed okay-okay closed 2 years ago

okay-okay commented 3 years ago

Hi, I was just wondering how exactly the object features are used in the model? At each timestep, does the model consider both image features and object features concatenated? Are you familiar with how to extract these object features from the raw frames (ex: will it work at a lower resolution image)? Thanks!

rohitgirdhar commented 3 years ago

Hi, The object features are used independently from image features. They are not extracted from images, I basically just used the features provided here as the input features. They are basically a 352-dim vector which represents which objects were detected by an object detector. You can refer to the RULSTM paper for more details on how those features were extracted.

okay-okay commented 3 years ago

Thanks! So do you train a different avt head on just object features and then late fuse predictions from the different heads during inference as in RULSTM?

rohitgirdhar commented 3 years ago

Yes, we train a separate model for object features and late fuse with other modalities. RULSTM actually uses a modality attention network (MATT) to combine the predictions; we simply do a weighted average of the predictions.