Open rileyhun opened 4 months ago
Hi Riley, thanks for raising the issue. It seems like this is most likely an error with the checkpoint conversion script in NVIDIA/TensorRT-LLM, since it is directly loading the weights and converting to numpy on CPU, while BFloat is a gpu-only type. I'd suggest creating a ticket in the TensorRT-LLM repo about this issue.
To work-around this issue in the meantime, you could manually convert and save the model in fp32 before loading it.
Hello @ydm-amazon,
Thanks for following up. I'll check w/ the TensorRT-LLM repo about the issue.
Also wanted to point out that I don't get this issue using the following args in the dockerfile:
ARG djl_version=0.27.0~SNAPSHOT
# Base Deps
ARG cuda_version=cu122
ARG python_version=3.10
ARG torch_version=2.1.0
ARG pydantic_version=2.6.1
ARG cuda_python_version=12.2.0
ARG ammo_version=0.5.0
ARG janus_version=1.0.0
ARG pynvml_version=11.5.0
ARG s5cmd_version=2.2.2
# HF Deps
ARG transformers_version=4.36.2
ARG accelerate_version=0.25.0
# Trtllm Deps
ARG tensorrtlibs_version=9.2.0.post12.dev5
ARG trtllm_toolkit_version=0.7.1
ARG trtllm_version=v0.7.1
That's right - we know that TensorRT-LLM switched to a different way of loading the model from 0.7.1 to 0.8.0, so that may have caused the issue. We're also looking into our trtllm toolkit 0.8.0 to see if there's something there that may also contribute to the issue.
Description
(A clear and concise description of what the bug is.)
I'm am building the DJL-Serving TensorRT-LLM LMI inference container from scratch, and deploying on Sagemaker Endpoints for Zephyr-7B model. Unfortunately, I run into an error from the
tensorrt_llm_toolkit
:TypeError: Got unsupported ScalarType BFloat16
Expected Behavior
(what's the expected behavior?) Expected the DJL-Serving Image derived from here (https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docker/tensorrt-llm.Dockerfile) to run successfully on Sagemaker Endpoints.
Error Message
(Paste the complete error message, including stack trace.)
How to Reproduce?
(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)
model_name = name_from_base(f"my-model-djl-tensorrt") print(model_name)
create_model_response = sm_client.create_model( ModelName=model_name, ExecutionRoleArn=role, PrimaryContainer={ "Image": inference_image_uri, "ModelDataUrl": code_artifact, "Environment": { "ENGINE": "MPI", "OPTION_TENSOR_PARALLEL_DEGREE": "8", "OPTION_USE_CUSTOM_ALL_REDUCE": "false", "OPTION_OUTPUT_FORMATTER": "json", "OPTION_MAX_ROLLING_BATCH_SIZE": "16", "OPTION_MODEL_LOADING_TIMEOUT": "1000", "OPTION_MAX_INPUT_LEN": "5000", "OPTION_MAX_OUTPUT_LEN": "1000", "OPTION_DTYPE": "bf16" } }, ) model_arn = create_model_response["ModelArn"]
print(f"Created Model: {model_arn}")
endpoint_config_name = f"{model_name}-config" endpoint_name = f"{model_name}-endpoint"
endpoint_config_response = sm_client.create_endpoint_config( EndpointConfigName=endpoint_config_name, ProductionVariants=[ { "VariantName": "variant1", "ModelName": model_name, "InstanceType": instance_type, "InitialInstanceCount": 1, "ModelDataDownloadTimeoutInSeconds": 2400, "ContainerStartupHealthCheckTimeoutInSeconds": 2400, }, ], ) endpoint_config_response
create_endpoint_response = sm_client.create_endpoint( EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name ) print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")