为什么要扩展keylayer，valuelayer的维度？

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

是CHatGLM2-6B的代码有人能帮忙解释下为什么要做维度的扩展吗？ if self.multi_query_attention: key_layer = key_layer.unsqueeze(-2) key_layer = key_layer.expand( -1, -1, -1, self.num_attention_heads_per_partition // self.num_multi_query_groups_per_partition, -1 ) key_layer = key_layer.contiguous().view( key_layer.size()[:2] + (self.num_attention_heads_per_partition, self.hidden_size_per_attention_head) ) value_layer = value_layer.unsqueeze(-2) value_layer = value_layer.expand( -1, -1, -1, self.num_attention_heads_per_partition // self.num_multi_query_groups_per_partition, -1 ) value_layer = value_layer.contiguous().view( value_layer.size()[:2] + (self.num_attention_heads_per_partition, self.hidden_size_per_attention_head) )

Expected Behavior

No response

Steps To Reproduce

None

Environment

- OS: Ubuntu18.04
- Python:3.8
- Transformers:
- PyTorch:2.0.1
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :

Anything else?

No response

THUDM / ChatGLM-6B