Open sayli-ds opened 12 months ago
Could create an endpoint as above for llama 13b base, but it gives a timeout error on container primary for 13b chat.
For the above, created the neuron artifacts for the 13b chat model using this - https://github.com/pytorch/serve/blob/master/examples/large_models/inferentia2/llama2/Readme.md?plain=1#L56 Could start torchserve and run inference via curl command here, so the model artifacts look okay. But the same artifacts won't work in the first notebook reference link.
[1]https://github.com/aws/amazon-sagemaker-examples-community/blob/main/torchserve/inf2/llama2/llama-2-13b.ipynb
Was able to run the notebook successfully for llama2 13b.
Followed instructions to create artifacts for llama2 13b chat model, saved them in model_store/.., and ran inference to test.
While utilizing the notebook[1] to run the 13b chat model instead of the base model, getting a timeout error. ReadTimeoutError: Read timeout on endpoint URL: "https://runtime.sagemaker.us-west-2.amazonaws.com/endpoints/.."
All other params are the same except tp_degree. I have used tp_degree=12 instead of 6 to utilize all cores on inf2.24x.