Both streams generate an output of 256 dimensions and thus the concatenated representations yield an image-audio embedding of 512 dimensions.

In the /models/imageAudio_model.py
class ImageAudioModel(torch.nn.Module): def name(self): return 'ImageAudioModel'

def __init__(self):
    super(ImageAudioModel, self).__init__()
    #initialize model
    self.imageAudio_fc1 = torch.nn.Linear(512 * 2, 512 * 2)
    self.imageAudio_fc1.apply(networks.weights_init)
    self.imageAudio_fc2 = torch.nn.Linear(512 * 2, 512)
    self.imageAudio_fc2.apply(networks.weights_init)

原文中写道两端都是256维向量，拼接成512维，而此处写的512*2，这样是不是两端各输入的512维向量

facebookresearch / Listen-to-Look

Both streams generate an output of 256 dimensions and thus the concatenated representations yield an image-audio embedding of 512 dimensions. #8