I checked the file model_video_caption_mplug.py but I find there is no universal layer module in the code, but I see that images are first fed into visual encoders and then fed into text encoders. Does it mean universal layer is actually text encoder?
I checked the file model_video_caption_mplug.py but I find there is no universal layer module in the code, but I see that images are first fed into visual encoders and then fed into text encoders. Does it mean universal layer is actually text encoder?