Closed zliendo closed 6 months ago
oh, one more piece of information, if sequence_length <=256, then I do get 35 ms/token for llama2 and codellama (unfortunately for code generation it usually needs sequence_length > 256)
Thanks @zliendo . We will take a look.
Thanks @zliendo I have checked the results in the blog and they are the results as of 11/7/23 when the blog was published. We are continuing to improve the performance of LLaMA on Neuron so will hope that you will try an upcoming release. Please keep an eye out for what's new and the performance page for updated performance data.
Thank you so much! I will for sure try upcoming releases.
Hi,
according to this blog https://huggingface.co/blog/inferentia-llama2
it seems the expected ms/token is about 60 when running inference for llama-2 on inf2.xlarge. I do get these results when running llama2 in inf2.xlarge.
I have also tested running codellama (finetuned from llama2) in a inf2.xlarge and I'm getting about 50 to 60 ms/token. When I run codellama in a g5.xlarge I get 30 ms/token, faster than using a inf2.xlarge
is this expected?
Thank you