Open rileyhun opened 6 months ago
We should use a sel-defined torch_to_numpy()
instead of .numpy()
to convert a torch tensor to numpy array to prevent the issue under bfloat16. This issue should be fixed in latest main branch. Could you take a look?
System Info
Description
I'm am building the DJL-Serving TensorRT-LLM LMI inference container from scratch, and deploying on Sagemaker Endpoints for Zephyr-7B model. Unfortunately, I run into an error from the
tensorrt_llm_toolkit
:TypeError: Got unsupported ScalarType BFloat16
. It seems like this is most likely an error with the checkpoint conversion script in NVIDIA/TensorRT-LLM, since it is directly loading the weights and converting to numpy on CPU, while BFloat is a gpu-only type.System Info: GPU: ml.g5.48xlarge (8 A10 GPUs on Sagemaker endpoints) OS: Ubuntu 22.04 LTS Model: Zephyr-7B Beta
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
model_name = name_from_base(f"my-model-djl-tensorrt") print(model_name)
create_model_response = sm_client.create_model( ModelName=model_name, ExecutionRoleArn=role, PrimaryContainer={ "Image": inference_image_uri, "ModelDataUrl": code_artifact, "Environment": { "ENGINE": "MPI", "OPTION_TENSOR_PARALLEL_DEGREE": "8", "OPTION_USE_CUSTOM_ALL_REDUCE": "false", "OPTION_OUTPUT_FORMATTER": "json", "OPTION_MAX_ROLLING_BATCH_SIZE": "16", "OPTION_MODEL_LOADING_TIMEOUT": "1000", "OPTION_MAX_INPUT_LEN": "5000", "OPTION_MAX_OUTPUT_LEN": "1000", "OPTION_DTYPE": "bf16" } }, ) model_arn = create_model_response["ModelArn"]
print(f"Created Model: {model_arn}")
endpoint_config_name = f"{model_name}-config" endpoint_name = f"{model_name}-endpoint"
endpoint_config_response = sm_client.create_endpoint_config( EndpointConfigName=endpoint_config_name, ProductionVariants=[ { "VariantName": "variant1", "ModelName": model_name, "InstanceType": instance_type, "InitialInstanceCount": 1, "ModelDataDownloadTimeoutInSeconds": 2400, "ContainerStartupHealthCheckTimeoutInSeconds": 2400, }, ], ) endpoint_config_response
create_endpoint_response = sm_client.create_endpoint( EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name ) print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")
ARG djl_version=0.27.0~SNAPSHOT
Base Deps
ARG cuda_version=cu122 ARG python_version=3.10 ARG torch_version=2.1.0 ARG pydantic_version=2.6.1 ARG cuda_python_version=12.2.0 ARG ammo_version=0.5.0 ARG janus_version=1.0.0 ARG pynvml_version=11.5.0 ARG s5cmd_version=2.2.2
HF Deps
ARG transformers_version=4.36.2 ARG accelerate_version=0.25.0
Trtllm Deps
ARG tensorrtlibs_version=9.2.0.post12.dev5 ARG trtllm_toolkit_version=0.7.1 ARG trtllm_version=v0.7.1
2024-04-25T11:17:00.976-07:00 [INFO ] LmiUtils - convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/build_scripts/llama/convert_checkpoint.py", line 1480, in covert_and_save 2024-04-25T11:17:00.976-07:00 [INFO ] LmiUtils - convert_py: weights = convert_hf_llama( 2024-04-25T11:17:00.976-07:00 [INFO ] LmiUtils - convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/build_scripts/llama/convert_checkpoint.py", line 1179, in convert_hf_llama 2024-04-25T11:17:00.976-07:00 [INFO ] LmiUtils - convert_py: np.pad(lm_head_weights.detach().cpu().numpy(), 2024-04-25T11:17:01.227-07:00 [INFO ] LmiUtils - convert_py: TypeError: Got unsupported ScalarType BFloat16