ajinkyaT / Lip_Reading_in_the_Wild_AVSR

Audio-Visual Speech Recognition using Deep Learning
59 stars 21 forks source link

How to fuse audio and visual models to mutual training? #4

Open Hydralisk2333 opened 4 years ago

Hydralisk2333 commented 4 years ago

I have read the paper "Lip Reading Sentences in the Wild". In this paper, At the end of Spell model, there is a MLP to convert the concatnated attention vectors into a predicted label. The paper has mentioned that the whole network can be trained with single model ( audio only or lips only). If I only use lips to train, the audio attention vector won't exist. In this case, how does the MLP works? What kind of data should input in this layer?