fix: implement llama3 RoPE scaling type and fix converter

OpenNMT / CTranslate2

Fast inference engine for Transformer models

https://opennmt.net/CTranslate2

MIT License

3.26k stars 287 forks source link

fix: implement llama3 RoPE scaling type and fix converter #1751

Closed ebraraktas closed 1 month ago

ebraraktas commented 2 months ago

Fixes #1745 .

Implementation is ported from tranformers

ebraraktas commented 2 months ago

While implementing this, I saw that rotary embedding layer is shared among layers of Llama3 and huggingface implementation is refactored to use that shared one. Maybe we can implement this feature in ctranslate2 to reduce memory usage.

minhthuc2502 commented 1 month ago

Hello, thank you for your PR. I will fix the CI soon and you can rebase on the master. What do you mean "rotary embedding layer is shared among layers of Llama3 and huggingface implementation". Can you add the link here?

ebraraktas commented 1 month ago

Thanks for the comment, I will rebase once you fix it. @minhthuc2502

For RoPE sharing:

rotary_emb will be removed from LlamaAttention, see this comment
position_embeddings (tuple of cos and sin) will be generated from input embeddings at the beginning of inference inside LlamaModel.forward and passed to the layers using it.
And position_embeddings will be passed as an input to LlamaAttention.forward, implemented here

minhthuc2502 commented 1 month ago

Ah I understand your point. Actually, we can keep the current architecture because It requires more changes than HF to do this.

minhthuc2502 commented 1 month ago

Could you rebase on the master branch? please

ebraraktas commented 1 month ago

As in this PR, no space left error occurred during docker step of the CI/CD. @minhthuc2502