I have read the paper "Lip Reading Sentences in the Wild". In this paper, At the end of Spell model, there is a MLP to convert the concatnated attention vectors into a predicted label. The paper has mentioned that the whole network can be trained with single model ( audio only or lips only). If I only use lips to train, the audio attention vector won't exist. In this case, how does the MLP works? What kind of data should input in this layer?
I have read the paper "Lip Reading Sentences in the Wild". In this paper, At the end of Spell model, there is a MLP to convert the concatnated attention vectors into a predicted label. The paper has mentioned that the whole network can be trained with single model ( audio only or lips only). If I only use lips to train, the audio attention vector won't exist. In this case, how does the MLP works? What kind of data should input in this layer?