Closed ebraraktas closed 1 month ago
While implementing this, I saw that rotary embedding layer is shared among layers of Llama3 and huggingface implementation is refactored to use that shared one. Maybe we can implement this feature in ctranslate2 to reduce memory usage.
Hello, thank you for your PR. I will fix the CI soon and you can rebase on the master. What do you mean "rotary embedding layer is shared among layers of Llama3 and huggingface implementation". Can you add the link here?
Thanks for the comment, I will rebase once you fix it. @minhthuc2502
For RoPE sharing:
rotary_emb
will be removed from LlamaAttention
, see this commentposition_embeddings
(tuple of cos
and sin
) will be generated from input embeddings at the beginning of inference inside LlamaModel.forward
and passed to the layers using it.position_embeddings
will be passed as an input to LlamaAttention.forward
, implemented hereAh I understand your point. Actually, we can keep the current architecture because It requires more changes than HF to do this.
Could you rebase on the master branch? please
As in this PR, no space left error occurred during docker step of the CI/CD. @minhthuc2502
Fixes #1745 .
Implementation is ported from tranformers