NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.6k stars 978 forks source link

TensorRT-LLM Conversion Script Bug: TypeError: Got unsupported ScalarType BFloat16 #1524

Open rileyhun opened 6 months ago

rileyhun commented 6 months ago

System Info

Description

I'm am building the DJL-Serving TensorRT-LLM LMI inference container from scratch, and deploying on Sagemaker Endpoints for Zephyr-7B model. Unfortunately, I run into an error from the tensorrt_llm_toolkit: TypeError: Got unsupported ScalarType BFloat16. It seems like this is most likely an error with the checkpoint conversion script in NVIDIA/TensorRT-LLM, since it is directly loading the weights and converting to numpy on CPU, while BFloat is a gpu-only type.

System Info: GPU: ml.g5.48xlarge (8 A10 GPUs on Sagemaker endpoints) OS: Ubuntu 22.04 LTS Model: Zephyr-7B Beta

Who can help?

No response

Information

Tasks

Reproduction

model_name = name_from_base(f"my-model-djl-tensorrt") print(model_name)

create_model_response = sm_client.create_model( ModelName=model_name, ExecutionRoleArn=role, PrimaryContainer={ "Image": inference_image_uri, "ModelDataUrl": code_artifact, "Environment": { "ENGINE": "MPI", "OPTION_TENSOR_PARALLEL_DEGREE": "8", "OPTION_USE_CUSTOM_ALL_REDUCE": "false", "OPTION_OUTPUT_FORMATTER": "json", "OPTION_MAX_ROLLING_BATCH_SIZE": "16", "OPTION_MODEL_LOADING_TIMEOUT": "1000", "OPTION_MAX_INPUT_LEN": "5000", "OPTION_MAX_OUTPUT_LEN": "1000", "OPTION_DTYPE": "bf16" } }, ) model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

- Create endpoint config:

endpoint_config_name = f"{model_name}-config" endpoint_name = f"{model_name}-endpoint"

endpoint_config_response = sm_client.create_endpoint_config( EndpointConfigName=endpoint_config_name, ProductionVariants=[ { "VariantName": "variant1", "ModelName": model_name, "InstanceType": instance_type, "InitialInstanceCount": 1, "ModelDataDownloadTimeoutInSeconds": 2400, "ContainerStartupHealthCheckTimeoutInSeconds": 2400, }, ], ) endpoint_config_response

- Create sagemaker endpoint:

create_endpoint_response = sm_client.create_endpoint( EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name ) print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")


### Expected behavior
Expected the DJL-Serving Image derived from here (https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docker/tensorrt-llm.Dockerfile) to run successfully on Sagemaker Endpoints.

*IMPORTANT:* An older version of the DJL-Serving TensorRT-LLM container works. 
These are the args I used to get it working:

ARG djl_version=0.27.0~SNAPSHOT

Base Deps

ARG cuda_version=cu122 ARG python_version=3.10 ARG torch_version=2.1.0 ARG pydantic_version=2.6.1 ARG cuda_python_version=12.2.0 ARG ammo_version=0.5.0 ARG janus_version=1.0.0 ARG pynvml_version=11.5.0 ARG s5cmd_version=2.2.2

HF Deps

ARG transformers_version=4.36.2 ARG accelerate_version=0.25.0

Trtllm Deps

ARG tensorrtlibs_version=9.2.0.post12.dev5 ARG trtllm_toolkit_version=0.7.1 ARG trtllm_version=v0.7.1


### actual behavior

2024-04-25T11:17:00.976-07:00 [INFO ] LmiUtils - convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/build_scripts/llama/convert_checkpoint.py", line 1480, in covert_and_save 2024-04-25T11:17:00.976-07:00 [INFO ] LmiUtils - convert_py: weights = convert_hf_llama( 2024-04-25T11:17:00.976-07:00 [INFO ] LmiUtils - convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/build_scripts/llama/convert_checkpoint.py", line 1179, in convert_hf_llama 2024-04-25T11:17:00.976-07:00 [INFO ] LmiUtils - convert_py: np.pad(lm_head_weights.detach().cpu().numpy(), 2024-04-25T11:17:01.227-07:00 [INFO ] LmiUtils - convert_py: TypeError: Got unsupported ScalarType BFloat16



### additional notes

N/A
byshiue commented 5 months ago

We should use a sel-defined torch_to_numpy() instead of .numpy() to convert a torch tensor to numpy array to prevent the issue under bfloat16. This issue should be fixed in latest main branch. Could you take a look?