aws-neuron / transformers-neuronx

Apache License 2.0
88 stars 25 forks source link

llama-2/codellama benchmark for inf2.xlarge #64

Closed zliendo closed 6 months ago

zliendo commented 6 months ago

Hi,

according to this blog https://huggingface.co/blog/inferentia-llama2
it seems the expected ms/token is about 60 when running inference for llama-2 on inf2.xlarge. I do get these results when running llama2 in inf2.xlarge.

I have also tested running codellama (finetuned from llama2) in a inf2.xlarge and I'm getting about 50 to 60 ms/token. When I run codellama in a g5.xlarge I get 30 ms/token, faster than using a inf2.xlarge

is this expected?

Thank you

zliendo commented 6 months ago

oh, one more piece of information, if sequence_length <=256, then I do get 35 ms/token for llama2 and codellama (unfortunately for code generation it usually needs sequence_length > 256)

jeffhataws commented 6 months ago

Thanks @zliendo . We will take a look.

jeffhataws commented 6 months ago

Thanks @zliendo I have checked the results in the blog and they are the results as of 11/7/23 when the blog was published. We are continuing to improve the performance of LLaMA on Neuron so will hope that you will try an upcoming release. Please keep an eye out for what's new and the performance page for updated performance data.

zliendo commented 6 months ago

Thank you so much! I will for sure try upcoming releases.