Listen to Look: Action Recognition by Previewing Audio (CVPR 2020)
Creative Commons Attribution 4.0 International
126
stars
15
forks
source link
Both streams generate an output of 256 dimensions and thus the concatenated representations yield an image-audio embedding of 512 dimensions. #8
Open
alice-cool opened 3 years ago
In the /models/imageAudio_model.py
class ImageAudioModel(torch.nn.Module): def name(self): return 'ImageAudioModel'
原文中写道两端都是256维向量,拼接成512维,而此处写的512*2, 这样是不是两端各输入的512维向量