System Info

Description

I'm am building the DJL-Serving TensorRT-LLM LMI inference container from scratch, and deploying on Sagemaker Endpoints for Zephyr-7B model. Unfortunately, I run into an error from the tensorrt_llm_toolkit: TypeError: Got unsupported ScalarType BFloat16. It seems like this is most likely an error with the checkpoint conversion script in NVIDIA/TensorRT-LLM, since it is directly loading the weights and converting to numpy on CPU, while BFloat is a gpu-only type.

System Info: GPU: ml.g5.48xlarge (8 A10 GPUs on Sagemaker endpoints) OS: Ubuntu 22.04 LTS Model: Zephyr-7B Beta

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Using TensorRT-LLM inference container derived from here (https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docker/tensorrt-llm.Dockerfile)
Inference Image Pushed to ECR
Model checkpoint for Zephyr-7B compressed as tarball file

Create model on Sagemaker:


from sagemaker.utils import name_from_base

model_name = name_from_base(f"my-model-djl-tensorrt") print(model_name)

create_model_response = sm_client.create_model( ModelName=model_name, ExecutionRoleArn=role, PrimaryContainer={ "Image": inference_image_uri, "ModelDataUrl": code_artifact, "Environment": { "ENGINE": "MPI", "OPTION_TENSOR_PARALLEL_DEGREE": "8", "OPTION_USE_CUSTOM_ALL_REDUCE": "false", "OPTION_OUTPUT_FORMATTER": "json", "OPTION_MAX_ROLLING_BATCH_SIZE": "16", "OPTION_MODEL_LOADING_TIMEOUT": "1000", "OPTION_MAX_INPUT_LEN": "5000", "OPTION_MAX_OUTPUT_LEN": "1000", "OPTION_DTYPE": "bf16" } }, ) model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

- Create endpoint config:

endpoint_config_name = f"{model_name}-config" endpoint_name = f"{model_name}-endpoint"

endpoint_config_response = sm_client.create_endpoint_config( EndpointConfigName=endpoint_config_name, ProductionVariants=[ { "VariantName": "variant1", "ModelName": model_name, "InstanceType": instance_type, "InitialInstanceCount": 1, "ModelDataDownloadTimeoutInSeconds": 2400, "ContainerStartupHealthCheckTimeoutInSeconds": 2400, }, ], ) endpoint_config_response

- Create sagemaker endpoint:

create_endpoint_response = sm_client.create_endpoint( EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name ) print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")


### Expected behavior
Expected the DJL-Serving Image derived from here (https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docker/tensorrt-llm.Dockerfile) to run successfully on Sagemaker Endpoints.

*IMPORTANT:* An older version of the DJL-Serving TensorRT-LLM container works. 
These are the args I used to get it working:

ARG djl_version=0.27.0~SNAPSHOT

Base Deps

ARG cuda_version=cu122 ARG python_version=3.10 ARG torch_version=2.1.0 ARG pydantic_version=2.6.1 ARG cuda_python_version=12.2.0 ARG ammo_version=0.5.0 ARG janus_version=1.0.0 ARG pynvml_version=11.5.0 ARG s5cmd_version=2.2.2

HF Deps

ARG transformers_version=4.36.2 ARG accelerate_version=0.25.0

Trtllm Deps

ARG tensorrtlibs_version=9.2.0.post12.dev5 ARG trtllm_toolkit_version=0.7.1 ARG trtllm_version=v0.7.1


### actual behavior

2024-04-25T11:17:00.976-07:00 [INFO ] LmiUtils - convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/build_scripts/llama/convert_checkpoint.py", line 1480, in covert_and_save 2024-04-25T11:17:00.976-07:00 [INFO ] LmiUtils - convert_py: weights = convert_hf_llama( 2024-04-25T11:17:00.976-07:00 [INFO ] LmiUtils - convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/build_scripts/llama/convert_checkpoint.py", line 1179, in convert_hf_llama 2024-04-25T11:17:00.976-07:00 [INFO ] LmiUtils - convert_py: np.pad(lm_head_weights.detach().cpu().numpy(), 2024-04-25T11:17:01.227-07:00 [INFO ] LmiUtils - convert_py: TypeError: Got unsupported ScalarType BFloat16



### additional notes

N/A

NVIDIA / TensorRT-LLM

TensorRT-LLM Conversion Script Bug: TypeError: Got unsupported ScalarType BFloat16 #1524