Open gaoyixuan111 opened 7 months ago
The statement at the end of the WPlusAttnProcessor Class defines the residual connection. Are you defining the initial hidden_states, which is the input from the previous step, as residual, and the hidden_states after W+QKV calculation as the actual functional residual? "The order of key statements is as follows." residual = hidden_states hidden_states = hidden_states + self.scale * wplus_hidden_states hidden_states = hidden_states + residual
Your work is excellent, thank you sincerely for your response.
In my understanding about your question, hidden_states = hidden_states + self.scale wplus_hidden_states the latter item ```self.scale wplus_hidden_states``` is the residual in our work.
The statement at the end of the WPlusAttnProcessor Class defines the residual connection. Are you defining the initial hidden_states, which is the input from the previous step, as residual, and the hidden_states after W+QKV calculation as the actual functional residual? "The order of key statements is as follows." residual = hidden_states hidden_states = hidden_states + self.scale * wplus_hidden_states hidden_states = hidden_states + residual
Your work is excellent, thank you sincerely for your response.