aws-neuron / aws-neuron-samples

Example code for AWS Neuron SDK developers building inference and training applications
Other
101 stars 32 forks source link

Deploy llama2 13b on inf2.24x #59

Closed sayli-ds closed 1 month ago

sayli-ds commented 7 months ago

https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb

What are the best ways to deploy the above model for fast inference from local machine and also support parallel requests?

jyang-aws commented 7 months ago

@sayli-ds perhaps you can try Sagemaker. some references: https://pytorch.org/blog/high-performance-llama/ https://github.com/aws-samples/amazon-sagemaker-local-mode/blob/main/pytorch_script_mode_local_training_and_serving/pytorch_script_mode_local_training_and_serving.py

sayli-ds commented 7 months ago

https://github.com/aws/amazon-sagemaker-examples-community/blob/main/torchserve/inf2/llama2/llama-2-13b.ipynb

Could create an endpoint as above for llama 13b base, but it gives a timeout error on container primary for 13b chat.

For above, created the neuron artifacts for the 13b chat model using this - https://github.com/pytorch/serve/blob/master/examples/large_models/inferentia2/llama2/Readme.md?plain=1#L56 Could start torchserve and run inference via curl command here, so the model artifacts look okay. But the same artifacts won't work in the first notebook reference link.

aws-taylor commented 1 month ago

Hello @sayli-ds,

This appears to be an issue with SageMaker and/or the community example model. I see you've already reached out to the community (https://github.com/aws/amazon-sagemaker-examples-community/issues/9). You may also want to reach out to the SageMaker community for how to debug models here https://repost.aws/search/content?globalSearch=sagemaker.

I'm going to resolve this for now since it does not appear to be an Inferentia issue, but feel free to re-open if you need further assistance.