facebookresearch / Listen-to-Look

Listen to Look: Action Recognition by Previewing Audio (CVPR 2020)
Creative Commons Attribution 4.0 International
126 stars 15 forks source link

Both streams generate an output of 256 dimensions and thus the concatenated representations yield an image-audio embedding of 512 dimensions. #8

Open alice-cool opened 3 years ago

alice-cool commented 3 years ago

In the /models/imageAudio_model.py
class ImageAudioModel(torch.nn.Module): def name(self): return 'ImageAudioModel'

def __init__(self):
    super(ImageAudioModel, self).__init__()
    #initialize model
    self.imageAudio_fc1 = torch.nn.Linear(512 * 2, 512 * 2)
    self.imageAudio_fc1.apply(networks.weights_init)
    self.imageAudio_fc2 = torch.nn.Linear(512 * 2, 512)
    self.imageAudio_fc2.apply(networks.weights_init)

原文中写道两端都是256维向量,拼接成512维,而此处写的512*2, 这样是不是两端各输入的512维向量