deepjavalibrary / djl-serving

A universal scalable machine learning model deployment solution
Apache License 2.0
189 stars 63 forks source link

DJL-TRTLLM: Error while detokenizing output response of teknium/OpenHermes-2.5-Mistral-7B on Sagemaker #1792

Open omarelshehy opened 4 months ago

omarelshehy commented 4 months ago

Description

I followed the recipe given here to manually convert teknium/OpenHermes-2.5-Mistral-7B to tensorrt on sagemaker's ml.g5.4xlarge and deploy the compiled model saved on s3 on sagemaker endpoint using ml.g5.2xlarge (only cpu and ram are different). When i invoke the endpoint simply using

import boto3
import json 

runtime = boto3.client("sagemaker-runtime")

endpoint_name = "djl-trtllm-endpoint"
content_type = "application/json"
payload = json.dumps({"inputs": "hey", "parameters": {}})

response = runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType=content_type,
    Body=payload)

I receive the following error log:

Error Message

[INFO ] PyProcess - W-350-model-stdout: [1,0]<stdout>:Rolling batch inference error
[INFO ] PyProcess - W-350-model-stdout: [1,0]<stdout>:Traceback (most recent call last):
[INFO ] PyProcess - W-350-model-stdout: [1,0]<stdout>:  File "/tmp/.djl.ai/python/0.26.0/djl_python/rolling_batch/rolling_batch.py", line 189, in try_catch_handling
[INFO ] PyProcess - W-350-model-stdout: [1,0]<stdout>:    return func(self, input_data, parameters)
[INFO ] PyProcess - W-350-model-stdout: [1,0]<stdout>:  File "/tmp/.djl.ai/python/0.26.0/djl_python/rolling_batch/trtllm_rolling_batch.py", line 80, in inference
[INFO ] PyProcess - W-350-model-stdout: [1,0]<stdout>:    generation = trt_resp.fetch()
[INFO ] PyProcess - W-350-model-stdout: [1,0]<stdout>:  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/detoknized_triton_repsonse.py", line 69, in fetch
[INFO ] PyProcess - W-350-model-stdout: [1,0]<stdout>:    self.decode_token(), len(self.all_input_ids), complete)
[INFO ] PyProcess - W-350-model-stdout: [1,0]<stdout>:  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/detoknized_triton_repsonse.py", line 45, in decode_token
[INFO ] PyProcess - W-350-model-stdout: [1,0]<stdout>:    new_text = self.tokenizer.decode(
[INFO ] PyProcess - W-350-model-stdout: [1,0]<stdout>:  File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 3750, in decode
[INFO ] PyProcess - W-350-model-stdout: [1,0]<stdout>:    return self._decode(
[INFO ] PyProcess - W-350-model-stdout: [1,0]<stdout>:  File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_fast.py", line 625, in _decode
[INFO ] PyProcess - W-350-model-stdout: [1,0]<stdout>:    text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
[INFO ] PyProcess - W-350-model-stdout: [1,0]<stdout>:TypeError: argument 'ids': 'list' object cannot be interpreted as an integer

I assume the error is coming from giving a list of lists to the _tokenizer.decode function instead of just a list of input_ids. Can someone help me understand why this happens ?

lanking520 commented 3 months ago

could you share which DJLServing or LMI version you are using?