How to fuse audio and visual models to mutual training?

I have read the paper "Lip Reading Sentences in the Wild". In this paper, At the end of Spell model, there is a MLP to convert the concatnated attention vectors into a predicted label. The paper has mentioned that the whole network can be trained with single model ( audio only or lips only). If I only use lips to train, the audio attention vector won't exist. In this case, how does the MLP works? What kind of data should input in this layer?

ajinkyaT / Lip_Reading_in_the_Wild_AVSR

How to fuse audio and visual models to mutual training? #4