aws / amazon-sagemaker-examples-community

MIT No Attribution
30 stars 20 forks source link

Hosting llama2 13b chat model using torch serve on inf2 #9

Open sayli-ds opened 12 months ago

sayli-ds commented 12 months ago

[1]https://github.com/aws/amazon-sagemaker-examples-community/blob/main/torchserve/inf2/llama2/llama-2-13b.ipynb

Was able to run the notebook successfully for llama2 13b.

Followed instructions to create artifacts for llama2 13b chat model, saved them in model_store/.., and ran inference to test.

While utilizing the notebook[1] to run the 13b chat model instead of the base model, getting a timeout error. ReadTimeoutError: Read timeout on endpoint URL: "https://runtime.sagemaker.us-west-2.amazonaws.com/endpoints/.."

All other params are the same except tp_degree. I have used tp_degree=12 instead of 6 to utilize all cores on inf2.24x.

sayli-ds commented 12 months ago

https://github.com/aws/amazon-sagemaker-examples-community/blob/main/torchserve/inf2/llama2/llama-2-13b.ipynb

Could create an endpoint as above for llama 13b base, but it gives a timeout error on container primary for 13b chat.

For the above, created the neuron artifacts for the 13b chat model using this - https://github.com/pytorch/serve/blob/master/examples/large_models/inferentia2/llama2/Readme.md?plain=1#L56 Could start torchserve and run inference via curl command here, so the model artifacts look okay. But the same artifacts won't work in the first notebook reference link.