Deploy llama2 13b on inf2.24x

aws-neuron / aws-neuron-samples

Example code for AWS Neuron SDK developers building inference and training applications

Other

101 stars 32 forks source link

Closed sayli-ds closed 1 month ago

sayli-ds commented 7 months ago

What are the best ways to deploy the above model for fast inference from local machine and also support parallel requests?

jyang-aws commented 7 months ago

sayli-ds commented 7 months ago

Could create an endpoint as above for llama 13b base, but it gives a timeout error on container primary for 13b chat.

For above, created the neuron artifacts for the 13b chat model using this - https://github.com/pytorch/serve/blob/master/examples/large_models/inferentia2/llama2/Readme.md?plain=1#L56 Could start torchserve and run inference via curl command here, so the model artifacts look okay. But the same artifacts won't work in the first notebook reference link.

aws-taylor commented 1 month ago

Hello @sayli-ds,

This appears to be an issue with SageMaker and/or the community example model. I see you've already reached out to the community (https://github.com/aws/amazon-sagemaker-examples-community/issues/9). You may also want to reach out to the SageMaker community for how to debug models here https://repost.aws/search/content?globalSearch=sagemaker.

I'm going to resolve this for now since it does not appear to be an Inferentia issue, but feel free to re-open if you need further assistance.