Closed BlueSkyBubble closed 3 years ago
你好,请问在BertSelfAttention中,hidden_states经过Q、K、V三个矩阵后分别得到mixed_query_layer,mixed_key_layer,mixed_value_layer三个结果,问题是:这三个结果为什么都要经过transpose_for_scores函数处理?特别是transpose_for_scores函数中的new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)该如何理解?
BertSelfAttention
hidden_states
mixed_query_layer
mixed_key_layer
mixed_value_layer
transpose_for_scores
new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
或者换个问法:为什么通过new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)就可以实现多头?
你好,请问在
BertSelfAttention
中,hidden_states
经过Q、K、V三个矩阵后分别得到mixed_query_layer
,mixed_key_layer
,mixed_value_layer
三个结果,问题是:这三个结果为什么都要经过transpose_for_scores
函数处理?特别是transpose_for_scores
函数中的new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
该如何理解?或者换个问法:为什么通过
new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
就可以实现多头?