Use linear layers instead of Conv1d. (CPU, GPU optimization)
Merge the key and value caches, and change how layer_past/presents is stored. (CPU, GPU)
Change the memory layout of MHA key-value from (2, self.num_heads, self.head_dim) to (self.num_heads, 2, self.head_dim) as needed to merge the caches efficiently. This is different from transformers GPT2, but it now matches Megatron-LM. (GPU)
Avoid redundant views from split_heads and merge_heads with MQA. Also move these functions inline (CPU)
Remove _matmul and put back specialized versions in _attn to save some cpu time. (CPU).
Adapt the conversion script to the new memory layout and trim it down.
New converted model is in the linear branch of bigcode/santacoder-fast-inference.
(2, self.num_heads, self.head_dim)
to(self.num_heads, 2, self.head_dim)
as needed to merge the caches efficiently. This is different from transformers GPT2, but it now matches Megatron-LM. (GPU)_matmul
and put back specialized versions in_attn
to save some cpu time. (CPU).linear
branch ofbigcode/santacoder-fast-inference
.