Closed okay-okay closed 2 years ago
Hi, The object features are used independently from image features. They are not extracted from images, I basically just used the features provided here as the input features. They are basically a 352-dim vector which represents which objects were detected by an object detector. You can refer to the RULSTM paper for more details on how those features were extracted.
Thanks! So do you train a different avt head on just object features and then late fuse predictions from the different heads during inference as in RULSTM?
Yes, we train a separate model for object features and late fuse with other modalities. RULSTM actually uses a modality attention network (MATT) to combine the predictions; we simply do a weighted average of the predictions.
Hi, I was just wondering how exactly the object features are used in the model? At each timestep, does the model consider both image features and object features concatenated? Are you familiar with how to extract these object features from the raw frames (ex: will it work at a lower resolution image)? Thanks!