Closed sayli-ds closed 1 month ago
@sayli-ds perhaps you can try Sagemaker. some references: https://pytorch.org/blog/high-performance-llama/ https://github.com/aws-samples/amazon-sagemaker-local-mode/blob/main/pytorch_script_mode_local_training_and_serving/pytorch_script_mode_local_training_and_serving.py
Could create an endpoint as above for llama 13b base, but it gives a timeout error on container primary for 13b chat.
For above, created the neuron artifacts for the 13b chat model using this - https://github.com/pytorch/serve/blob/master/examples/large_models/inferentia2/llama2/Readme.md?plain=1#L56 Could start torchserve and run inference via curl command here, so the model artifacts look okay. But the same artifacts won't work in the first notebook reference link.
Hello @sayli-ds,
This appears to be an issue with SageMaker and/or the community example model. I see you've already reached out to the community (https://github.com/aws/amazon-sagemaker-examples-community/issues/9). You may also want to reach out to the SageMaker community for how to debug models here https://repost.aws/search/content?globalSearch=sagemaker.
I'm going to resolve this for now since it does not appear to be an Inferentia issue, but feel free to re-open if you need further assistance.
https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb
What are the best ways to deploy the above model for fast inference from local machine and also support parallel requests?