[Llama2 inferentia] : runtime error when invoking endpoint through boto3

Link to the notebook https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/jumpstart-foundation-models/aws-trainium-inferentia-finetuning-deployment/llama-2-trainium-inferentia-finetuning-deployment.ipynb

Describe the bug Using a Lambda function with boto3 to query the neuron llama2 7b f model deployed on a ML INF2 XLARGE instance, the invoke endpoint operation fails with the following message:

{
  "errorMessage": "An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message \"{\n  \"code\": 400,\n  \"type\": \"BadRequestException\",\n  \"message\": \"Parameter model_name is required.\"\n}\n\". See https://us-east-2.console.aws.amazon.com/cloudwatch/home?region=us-east-2#logEventViewer:group=/aws/sagemaker/Endpoints/testllamaneuron in account XXXXXXX for more information.",
  "errorType": "ModelError",
  "requestId": "2f2a7aa4-9eeb-42f5-9a14-6285894581bb",
  "stackTrace": [
    "  File \"/var/task/lambda.py\", line 19, in handler\n    response = runtime.invoke_endpoint(EndpointName=ENDPOINT_NAME,\n",
    "  File \"/var/runtime/botocore/client.py\", line 530, in _api_call\n    return self._make_api_call(operation_name, kwargs)\n",
    "  File \"/var/runtime/botocore/client.py\", line 960, in _make_api_call\n    raise error_class(parsed_response, operation_name)\n"
  ]
}

The model configuration is as follow:

image: 763104351884.dkr.ecr.us-east-2.amazonaws.com/djl-inference:0.24.0-neuronx-sdk2.14.1
env variables:
modelId: meta-textgenerationneuron-llama-2-7b-f
modelVersion: 1.0.0

To reproduce

Deploy the model to an endpoint
Create a lambda function to query the endpoint with the following code:

import boto3
import json

def handler(event, context):
    runtime= boto3.client('runtime.sagemaker')

    ENDPOINT_NAME = 'testllamaneuron'

    dic = {
     "inputs": [
      [
       {"role": "system", "content": "You are chat bot who writes songs"},
       {"role": "user", "content": "Write a rap song about Amazon Web Services"}
      ]
     ],
     "parameters": {"max_new_tokens":256, "top_p":0.9, "temperature":0.6}
    }

    response = runtime.invoke_endpoint(EndpointName=ENDPOINT_NAME,
                                       ContentType='application/json',
                                       Body=json.dumps(dic),
                                       CustomAttributes="accept_eula=true")

    result = json.loads(response['Body'].read().decode())
    print(result)

    return {
        "statusCode": 200,
        "body": json.dumps(result)
    }

Logs

Lambda Function logs:

[ERROR] ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
  "code": 400,
  "type": "BadRequestException",
  "message": "Parameter model_name is required."
}

aws / amazon-sagemaker-examples

[Llama2 inferentia] : runtime error when invoking endpoint through boto3 #4549