Open YunChen1227 opened 5 months ago
It looks caused by mismatch of TRT-LLM version between example and TRT-LLM core.
gpu_weights_percent
is not added in TRT-LLM 0.9.0. So, you might use newer example code and run it on TRT-LLM v0.9.0.
Please try installing the latest main branch.
I checked the ModelRunnerCpp.py file. gpu_weights_percent is in the function with default 1.
Anyway, this is not the promblem I want to solve now. I tried to deploy the engine converted by TRT-LLM v0.9.0 using Triton Server, but it always fail. Could you please help me to solve this problem? Below is the error I met. I just follow the QuickStart Guide provided. But whatever the version of Triton Server I used, it didn't solve this problem.
docker run -it --rm --gpus all --network host --shm-size=1g \ -v $(pwd)/all_models:/all_models \ -v $(pwd)/scripts:/opt/scripts \ nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3
huggingface-cli login --token *****
pip install sentencepiece protobuf
python /opt/scripts/launch_triton_server.py --model_repo /all_models/inflight_batcher_llm --world_size 2
E0530 02:04:50.281196 2894 model_lifecycle.cc:638] failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:72)
1 0x7fd9982614ba tensorrt_llm::common::throwRuntimeError(char const, int, std::string const&) + 102
2 0x7fd9982850a0 /opt/tritonserver/backends/tensorrtllm/libtensorrt_llm.so(+0x79c0a0) [0x7fd9982850a0]
3 0x7fd99a0cb572 tensorrt_llm::runtime::GptSession::GptSession(tensorrt_llm::runtime::GptSession::Config const&, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, void const, unsigned long, std::shared_ptr
nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3
only installs the TRT-LLM v0.8.0. You cannot use it to serve the engine built by TRT-LLM v0.9.0.
I tried converting checkpoint and building engine with TensorRT-LLM v0.8.0 and deploying it using 24.02, but below is what I got
root@ccnl06:/cognitive_comp/chenyun/tensorrtllm_backend# docker run -it --rm --gpus all --network host --shm-size=40g -v $(pwd)/all_models:/all_models -v $(pwd)/scripts:/opt/scripts -v /cognitive_comp/chenyun/models:/cognitive_comp/chenyun/models nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3
NVIDIA Release 24.02 (build 83572707) Triton Server Version 2.43.0
Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
would you please tell me what is the problem?
Here says that None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. But I pip install torch==2.2 did not solve this problem
The latest examples/run.py #1688 still has the "gpu_weights_percent" so it will cause error when running engine built by TRTLLM 0.9.0. A quick solution is removing the line for passing that argument (line 430), then also need to remove the other arguments (line 445-449) then you can run the engine. I can run llama3 engine built by TRTLLM 0.9.0 with it.
Thanks for giving this instruction. I wonder if you have tried deploying the engine using the triton server? Do you have any suggestions for that?
Yes I can deploy the engine successfully with triton. You need to modify config.pbtxt files as in instruction here https://developer.nvidia.com/blog/turbocharging-meta-llama-3-performance-with-nvidia-tensorrt-llm-and-nvidia-triton-inference-server/
Thanks, but it is a bit different to LLAMA2-13B, which must have to do the tensor parallel. some parameters might be different from the instruction and I failed again. Anyway, thank you very much for giving me these advice.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
System Info
using 3090 and the docker image produced by the QuickStart Doc
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
after building the llama2 engine
python3 ../run.py --max_output_len=40 --tokenizer_dir /models/0520/ckpt/0/global_step3900-hf/ --engine_dir /models/tmp/llama/7B/trt_engines/fp16/2-gpu/ --input_text ...
Expected behavior
expect to get an answer from the model
actual behavior
hwloc/linux: Ignoring PCI device with non-16bit domain. Pass --enable-32bits-pci-domain to configure to support such devices (warning: it would break the library ABI, don't enable unless really needed). hwloc/linux: Ignoring PCI device with non-16bit domain. Pass --enable-32bits-pci-domain to configure to support such devices (warning: it would break the library ABI, don't enable unless really needed). [TensorRT-LLM] TensorRT-LLM version: 0.9.0 Traceback (most recent call last): File "/TensorRT-LLM/examples/llama/../run.py", line 571, in
main(args)
File "/TensorRT-LLM/examples/llama/../run.py", line 420, in main
runner = runner_cls.from_dir(**runner_kwargs)
TypeError: ModelRunnerCpp.from_dir() got an unexpected keyword argument 'gpu_weights_percent'
additional notes
The model I converted does not have much diffferences compared to the origin LLAMA2 13B. Every steps before RUNNING, which are CONVERT_CHECKPOINTS.PY and TRTLLM-BUILD.py worked perfectly.