Question about Abstractor's FFN and Attention

@LukeForeverYoung @MAGAer13 First of all, thanks for your great work.

I have a question regarding the Feed Forward Network (FFN) of the Abstarctor and the forward method of MplugOwlVisualAbstractorAttention.

From a #10 issue, I knowed that the abstractor uses an FFN that applies Llama's SwinGLU. However, in mPlugOwl, it uses LayerNorm instead of Llama's RMSNorm. Is there a reason for this change? Is LayerNorm used instead of RMSNorm because the Abstarctor is a module for processing images?

Also, as far as I know, MplugOwlVisualAbstractorAttention is designed based on the Q-Former from BLIP-2.

# HACK we apply norm on q and k
hidden_states = self.norm1(hidden_states)
encoder_hidden_states = self.normk(encoder_hidden_states)

However, there is a piece of code in the forward method of MplugOwlVisualAbstractorAttention that does not exist in the Q-Former. Was there a problem in the implementation that required this addition?

X-PLUG / mPLUG-Owl

Question about Abstractor's FFN and Attention #219