Hosting llama2 13b chat model using torch serve on inf2

aws / amazon-sagemaker-examples-community

MIT No Attribution

30 stars 20 forks source link

Was able to run the notebook successfully for llama2 13b.

Followed instructions to create artifacts for llama2 13b chat model, saved them in model_store/.., and ran inference to test.

While utilizing the notebook[1] to run the 13b chat model instead of the base model, getting a timeout error. ReadTimeoutError: Read timeout on endpoint URL: "https://runtime.sagemaker.us-west-2.amazonaws.com/endpoints/.."

All other params are the same except tp_degree. I have used tp_degree=12 instead of 6 to utilize all cores on inf2.24x.

aws / amazon-sagemaker-examples-community