YunChen1227 commented 6 months ago

System Info

using 3090 and the docker image produced by the QuickStart Doc

Who can help?

No response

Information

[X] The official example scripts
[X] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

after building the llama2 engine

python3 ../run.py --max_output_len=40 --tokenizer_dir /models/0520/ckpt/0/global_step3900-hf/ --engine_dir /models/tmp/llama/7B/trt_engines/fp16/2-gpu/ --input_text ...

Expected behavior

expect to get an answer from the model

actual behavior

hwloc/linux: Ignoring PCI device with non-16bit domain. Pass --enable-32bits-pci-domain to configure to support such devices (warning: it would break the library ABI, don't enable unless really needed). hwloc/linux: Ignoring PCI device with non-16bit domain. Pass --enable-32bits-pci-domain to configure to support such devices (warning: it would break the library ABI, don't enable unless really needed). [TensorRT-LLM] TensorRT-LLM version: 0.9.0 Traceback (most recent call last): File "/TensorRT-LLM/examples/llama/../run.py", line 571, in main(args) File "/TensorRT-LLM/examples/llama/../run.py", line 420, in main runner = runner_cls.from_dir(**runner_kwargs) TypeError: ModelRunnerCpp.from_dir() got an unexpected keyword argument 'gpu_weights_percent'

additional notes

The model I converted does not have much diffferences compared to the origin LLAMA2 13B. Every steps before RUNNING, which are CONVERT_CHECKPOINTS.PY and TRTLLM-BUILD.py worked perfectly.

byshiue commented 6 months ago

It looks caused by mismatch of TRT-LLM version between example and TRT-LLM core.

gpu_weights_percent is not added in TRT-LLM 0.9.0. So, you might use newer example code and run it on TRT-LLM v0.9.0.

Please try installing the latest main branch.

YunChen1227 commented 6 months ago

I checked the ModelRunnerCpp.py file. gpu_weights_percent is in the function with default 1.

Anyway, this is not the promblem I want to solve now. I tried to deploy the engine converted by TRT-LLM v0.9.0 using Triton Server, but it always fail. Could you please help me to solve this problem? Below is the error I met. I just follow the QuickStart Guide provided. But whatever the version of Triton Server I used, it didn't solve this problem.

docker run -it --rm --gpus all --network host --shm-size=1g \ -v $(pwd)/all_models:/all_models \ -v $(pwd)/scripts:/opt/scripts \ nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3

Log in to huggingface-cli to get tokenizer

huggingface-cli login --token *****

Install python dependencies

pip install sentencepiece protobuf

Launch Server

python /opt/scripts/launch_triton_server.py --model_repo /all_models/inflight_batcher_llm --world_size 2

E0530 02:04:50.281196 2894 model_lifecycle.cc:638] failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:72) 1 0x7fd9982614ba tensorrt_llm::common::throwRuntimeError(char const, int, std::string const&) + 102 2 0x7fd9982850a0 /opt/tritonserver/backends/tensorrtllm/libtensorrt_llm.so(+0x79c0a0) [0x7fd9982850a0] 3 0x7fd99a0cb572 tensorrt_llm::runtime::GptSession::GptSession(tensorrt_llm::runtime::GptSession::Config const&, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, void const, unsigned long, std::shared_ptr) + 946 4 0x7fd99a15731d tensorrt_llm::batch_manager::TrtGptModelV1::TrtGptModelV1(int, std::shared_ptr, tensorrt_llm::runtime::GptModelConfig, tensorrt_llm::runtime::WorldConfig, std::vector<unsigned char, std::allocator > const&, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 701 5 0x7fd99a125dd4 tensorrt_llm::batch_manager::TrtGptModelFactory::create(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 2804 6 0x7fd99a11ce00 tensorrt_llm::batch_manager::GptManager::GptManager(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, std::function<std::list<std::shared_ptr, std::allocator<std::shared_ptr > > (int)>, std::function<void (unsigned long, std::list<tensorrt_llm::batch_manager::NamedTensor, std::allocator > const&, bool, std::string const&)>, std::function<std::unordered_set<unsigned long, std::hash, std::equal_to, std::allocator > ()>, std::function<void (std::string const&)>, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&, std::optional, std::optional, bool) + 336 7 0x7fdb1412bb62 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x18b62) [0x7fdb1412bb62] 8 0x7fdb1412c3f2 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x193f2) [0x7fdb1412c3f2] 9 0x7fdb1411efd5 TRITONBACKEND_ModelInstanceInitialize + 101 10 0x7fdb2e732296 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ad296) [0x7fdb2e732296] 11 0x7fdb2e7334d6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ae4d6) [0x7fdb2e7334d6] 12 0x7fdb2e716045 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191045) [0x7fdb2e716045] 13 0x7fdb2e716686 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191686) [0x7fdb2e716686] 14 0x7fdb2e722efd /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19defd) [0x7fdb2e722efd] 15 0x7fdb2dd86ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7fdb2dd86ee8] 16 0x7fdb2e70cf0b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x187f0b) [0x7fdb2e70cf0b] 17 0x7fdb2e71dc65 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x198c65) [0x7fdb2e71dc65] 18 0x7fdb2e72231e /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19d31e) [0x7fdb2e72231e] 19 0x7fdb2e8140c8 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x28f0c8) [0x7fdb2e8140c8] 20 0x7fdb2e8179ac /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2929ac) [0x7fdb2e8179ac] 21 0x7fdb2e96b6c2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3e66c2) [0x7fdb2e96b6c2] 22 0x7fdb2dff2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fdb2dff2253] 23 0x7fdb2dd81ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fdb2dd81ac3] 24 0x7fdb2de12a04 clone + 68 I0530 02:04:50.281238 2894 model_lifecycle.cc:773] failed to load 'tensorrt_llm' I0530 02:04:50.281621 2894 server.cc:607]

byshiue commented 6 months ago

nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3 only installs the TRT-LLM v0.8.0. You cannot use it to serve the engine built by TRT-LLM v0.9.0.

YunChen1227 commented 6 months ago

I tried converting checkpoint and building engine with TensorRT-LLM v0.8.0 and deploying it using 24.02, but below is what I got

root@ccnl06:/cognitive_comp/chenyun/tensorrtllm_backend# docker run -it --rm --gpus all --network host --shm-size=40g -v $(pwd)/all_models:/all_models -v $(pwd)/scripts:/opt/scripts -v /cognitive_comp/chenyun/models:/cognitive_comp/chenyun/models nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3

============================= == Triton Inference Server ==

NVIDIA Release 24.02 (build 83572707) Triton Server Version 2.43.0

This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

root@ccnl06:/opt/tritonserver# pip install sentencepiece protobuf Collecting sentencepiece Downloading sentencepiece-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB) Collecting protobuf Downloading protobuf-5.27.0-cp38-abi3-manylinux2014_x86_64.whl.metadata (592 bytes) Downloading sentencepiece-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 2.9 MB/s eta 0:00:00 Downloading protobuf-5.27.0-cp38-abi3-manylinux2014_x86_64.whl (309 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 309.2/309.2 kB 39.0 MB/s eta 0:00:00 Installing collected packages: sentencepiece, protobuf Successfully installed protobuf-5.27.0 sentencepiece-0.2.0 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv root@ccnl06:/opt/tritonserver# root@ccnl06:/opt/tritonserver# root@ccnl06:/opt/tritonserver# root@ccnl06:/opt/tritonserver# root@ccnl06:/opt/tritonserver# python3 /opt/scripts/launch_triton_server.py --model_repo /all_models/inflight_batcher_llm --world_size 2 root@ccnl06:/opt/tritonserver# I0530 07:21:33.732565 122 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7f5446000000' with size 268435456 I0530 07:21:33.732973 121 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7f7270000000' with size 268435456 I0530 07:21:33.740504 122 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864 I0530 07:21:33.740514 122 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864 I0530 07:21:33.740517 122 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864 I0530 07:21:33.740520 122 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864 I0530 07:21:33.740523 122 cuda_memory_manager.cc:107] CUDA memory pool is created on device 4 with size 67108864 I0530 07:21:33.740525 122 cuda_memory_manager.cc:107] CUDA memory pool is created on device 5 with size 67108864 I0530 07:21:33.740527 122 cuda_memory_manager.cc:107] CUDA memory pool is created on device 6 with size 67108864 I0530 07:21:33.740530 122 cuda_memory_manager.cc:107] CUDA memory pool is created on device 7 with size 67108864 I0530 07:21:33.740728 121 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864 I0530 07:21:33.740738 121 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864 I0530 07:21:33.740740 121 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864 I0530 07:21:33.740743 121 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864 I0530 07:21:33.740745 121 cuda_memory_manager.cc:107] CUDA memory pool is created on device 4 with size 67108864 I0530 07:21:33.740747 121 cuda_memory_manager.cc:107] CUDA memory pool is created on device 5 with size 67108864 I0530 07:21:33.740750 121 cuda_memory_manager.cc:107] CUDA memory pool is created on device 6 with size 67108864 I0530 07:21:33.740753 121 cuda_memory_manager.cc:107] CUDA memory pool is created on device 7 with size 67108864 W0530 07:21:34.815881 122 server.cc:251] failed to enable peer access for some device pairs W0530 07:21:34.818235 121 server.cc:251] failed to enable peer access for some device pairs I0530 07:21:34.820030 122 model_lifecycle.cc:469] loading: tensorrt_llm:1 I0530 07:21:34.830899 121 model_lifecycle.cc:469] loading: postprocessing:1 I0530 07:21:34.831271 121 model_lifecycle.cc:469] loading: preprocessing:1 I0530 07:21:34.831631 121 model_lifecycle.cc:469] loading: tensorrt_llm:1 I0530 07:21:34.831972 121 model_lifecycle.cc:469] loading: tensorrt_llm_bls:1 I0530 07:21:34.876348 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (GPU device 1) I0530 07:21:34.876418 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (GPU device 0) I0530 07:21:34.876475 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (GPU device 2) I0530 07:21:34.876535 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (GPU device 3) I0530 07:21:34.876708 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (GPU device 4) I0530 07:21:34.876744 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (GPU device 1) I0530 07:21:34.876990 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (GPU device 6) I0530 07:21:34.877019 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (GPU device 0) I0530 07:21:34.877031 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (GPU device 5) I0530 07:21:34.877144 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (GPU device 2) I0530 07:21:34.877180 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (GPU device 7) I0530 07:21:34.877272 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (GPU device 3) I0530 07:21:34.877419 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (GPU device 4) I0530 07:21:34.877577 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (GPU device 5) I0530 07:21:34.877706 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (GPU device 6) I0530 07:21:34.877731 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (GPU device 7) [TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set [TensorRT-LLM][WARNING] max_beam_width is not specified, will use default value of 1 [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict) [TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false. [TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false [TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true [TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false [TensorRT-LLM][INFO] Engine version 0.8.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be array, but is null [TensorRT-LLM][WARNING] Optional value for parameter lora_target_modules will not be set. [TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][INFO] Initializing MPI with thread mode 1 I0530 07:21:35.426101 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (GPU device 6) I0530 07:21:35.428677 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (GPU device 2) I0530 07:21:35.428772 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (GPU device 1) I0530 07:21:35.428694 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (GPU device 0) I0530 07:21:35.432108 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (GPU device 3) I0530 07:21:35.434864 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (GPU device 7) I0530 07:21:35.435693 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (GPU device 4) I0530 07:21:35.437788 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (GPU device 5) [TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set [TensorRT-LLM][WARNING] max_beam_width is not specified, will use default value of 1 [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict) [TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false. [TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false [TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true [TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false [TensorRT-LLM][INFO] Engine version 0.8.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be array, but is null [TensorRT-LLM][WARNING] Optional value for parameter lora_target_modules will not be set. [TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][INFO] Initializing MPI with thread mode 1 None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. [TensorRT-LLM][INFO] MPI size: 2, rank: 0 [TensorRT-LLM][INFO] Rank 0 is using GPU 0 [TensorRT-LLM][INFO] MPI size: 2, rank: 1 [TensorRT-LLM][INFO] Rank 1 is using GPU 1 I0530 07:21:36.039394 121 model_lifecycle.cc:835] successfully loaded 'tensorrt_llm_bls' None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. Keyword arguments {'add_special_tokens': False} not recognized. Keyword arguments {'add_special_tokens': False} not recognized. Keyword arguments {'add_special_tokens': False} not recognized. Keyword arguments {'add_special_tokens': False} not recognized. None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. Keyword arguments {'add_special_tokens': False} not recognized. Keyword arguments {'add_special_tokens': False} not recognized. Keyword arguments {'add_special_tokens': False} not recognized. Keyword arguments {'add_special_tokens': False} not recognized. Keyword arguments {'add_special_tokens': False} not recognized. Keyword arguments {'add_special_tokens': False} not recognized. Keyword arguments {'add_special_tokens': False} not recognized. Keyword arguments {'add_special_tokens': False} not recognized. Keyword arguments {'add_special_tokens': False} not recognized. Keyword arguments {'add_special_tokens': False} not recognized. Keyword arguments {'add_special_tokens': False} not recognized. Keyword arguments {'add_special_tokens': False} not recognized. I0530 07:21:36.287594 121 model_lifecycle.cc:835] successfully loaded 'preprocessing' I0530 07:21:36.316022 121 model_lifecycle.cc:835] successfully loaded 'postprocessing' [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2048 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available. [TensorRT-LLM][WARNING] Device 0 peer access Device 2 is not available. [TensorRT-LLM][WARNING] Device 0 peer access Device 3 is not available. [TensorRT-LLM][WARNING] Device 0 peer access Device 4 is not available. [TensorRT-LLM][WARNING] Device 0 peer access Device 5 is not available. [TensorRT-LLM][WARNING] Device 0 peer access Device 6 is not available. [TensorRT-LLM][WARNING] Device 0 peer access Device 7 is not available. [TensorRT-LLM][INFO] Loaded engine size: 12682 MiB [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2048 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][WARNING] Device 1 peer access Device 0 is not available. [TensorRT-LLM][WARNING] Device 1 peer access Device 2 is not available. [TensorRT-LLM][WARNING] Device 1 peer access Device 3 is not available. [TensorRT-LLM][WARNING] Device 1 peer access Device 4 is not available. [TensorRT-LLM][WARNING] Device 1 peer access Device 5 is not available. [TensorRT-LLM][WARNING] Device 1 peer access Device 6 is not available. [TensorRT-LLM][WARNING] Device 1 peer access Device 7 is not available. [TensorRT-LLM][INFO] Loaded engine size: 12682 MiB [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 12804, GPU 13348 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 12806, GPU 13358 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 12801, GPU 13350 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 12803, GPU 13360 (MiB) Failed, NCCL error /tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/plugins/common/plugin.cpp:86 'internal error - please report this issue to the NCCL developers' Failed, NCCL error /tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/plugins/common/plugin.cpp:86 'internal error - please report this issue to the NCCL developers'

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[55548,1],1] Exit code: 1

would you please tell me what is the problem?

YunChen1227 commented 6 months ago

Here says that None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. But I pip install torch==2.2 did not solve this problem

dinhkt commented 6 months ago

The latest examples/run.py #1688 still has the "gpu_weights_percent" so it will cause error when running engine built by TRTLLM 0.9.0. A quick solution is removing the line for passing that argument (line 430), then also need to remove the other arguments (line 445-449) then you can run the engine. I can run llama3 engine built by TRTLLM 0.9.0 with it.

YunChen1227 commented 6 months ago

Thanks for giving this instruction. I wonder if you have tried deploying the engine using the triton server? Do you have any suggestions for that?

dinhkt commented 6 months ago

Yes I can deploy the engine successfully with triton. You need to modify config.pbtxt files as in instruction here https://developer.nvidia.com/blog/turbocharging-meta-llama-3-performance-with-nvidia-tensorrt-llm-and-nvidia-triton-inference-server/

YunChen1227 commented 6 months ago

Thanks, but it is a bit different to LLAMA2-13B, which must have to do the tensor parallel. some parameters might be different from the instruction and I failed again. Anyway, thank you very much for giving me these advice.

github-actions[bot] commented 5 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

nv-guomingz commented 2 weeks ago

Hi @YunChen1227 do u still have further issue or question now? If not, we'll close it soon.

NVIDIA / TensorRT-LLM

TypeError: ModelRunnerCpp.from_dir() got an unexpected keyword argument 'gpu_weights_percent' #1664

System Info

Who can help?

Information

Tasks

Reproduction

after building the llama2 engine

Expected behavior

actual behavior

additional notes

Log in to huggingface-cli to get tokenizer

Install python dependencies

Launch Server

============================= == Triton Inference Server ==

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

Process name: [[55548,1],1] Exit code: 1