NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.41k stars 800 forks source link

nccl reports 'out of memory' when deploy llama3 to triton on 8xV100 #1670

Open forrestjgq opened 1 month ago

forrestjgq commented 1 month ago

System Info

-CPU: x86

Driver:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-32GB            On | 00000000:00:08.0 Off |                    0 |
| N/A   44C    P0               57W / 300W|  32498MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2-32GB            On | 00000000:00:09.0 Off |                    0 |
| N/A   43C    P0               58W / 300W|  19530MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2-32GB            On | 00000000:00:0A.0 Off |                    0 |
| N/A   42C    P0               57W / 300W|  19474MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2-32GB            On | 00000000:00:0B.0 Off |                    0 |
| N/A   44C    P0               57W / 300W|  19476MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2-32GB            On | 00000000:00:0C.0 Off |                    0 |
| N/A   43C    P0               59W / 300W|  19476MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2-32GB            On | 00000000:00:0D.0 Off |                    0 |
| N/A   42C    P0               58W / 300W|  19476MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2-32GB            On | 00000000:00:0E.0 Off |                    0 |
| N/A   44C    P0               61W / 300W|  19476MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2-32GB            On | 00000000:00:0F.0 Off |                    0 |
| N/A   45C    P0               62W / 300W|  21482MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

Who can help?

@byshiue

Information

Tasks

Reproduction

  1. checkout tensorrt-llm and backend v0.9.0, and follow https://developer.nvidia.com/zh-cn/blog/turbocharging-meta-llama-3-performance-with-nvidia-tensorrt-llm-and-nvidia-triton-inference-server/

  2. build llama-3 engine in nvidia/cuda:12.1.0-devel-ubuntu22.04:

model=/home/jgq/cloud/models/llama3-70b-instruct checkpoint=/home/jgq/cloud/engines/llama3-70b-instruct/cvt-tp output=/home/jgq/cloud/engines/llama3-70b-instruct/engines-tp rm -rf $checkpoint && mkdir $checkpoint rm -rf $output && mkdir $output tp=2 pp=4 dtype=float16

cd $trtllm/examples/llama python3 convert_checkpoint.py --model_dir $model \ --output_dir $checkpoint \ --tp_size $tp \ --pp_size $pp \ --dtype $dtype

trtllm-build --checkpoint_dir $checkpoint \ --output_dir $output \ --max_batch_size 64 \ --max_input_len 1024 \ --max_output_len 512 \ --tp_size $tp \ --pp_size $pp \ --gpt_attention_plugin $dtype \ --gemm_plugin $dtype


3. run triton docker

image=nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3 name=triton-pp8

nvidia-docker run -e DISPLAY=unix: -it --net=host --ulimit core=-1 --ulimit memlock=-1 \ --security-opt seccomp=unconfined --detach-keys=ctrl-i,c \ --shm-size='10g' --ipc=host \ -v /home/jgq:/home/jgq \ -v /var/run/docker.sock:/var/run/docker.sock \ -v /usr/bin/docker:/usr/bin/docker \ -v /tmp/.X11-unix/:/tmp/.X11-unix/ \ -w /home/jgq \ --privileged -v /etc/timezone:/etc/timezone:ro \ --name $name $image /bin/bash


4. in triton docker, run triton server:

export NCCL_DEBUG=INFO python3 /home/jgq/tensorrtllm_backend/scripts/launch_triton_server.py --model_repo /home/jgq/tensorrtllm_backend/all_models/inflight_batcher_llm_pp8 --world_size 8


the backend pipeline is attached:
[backend.tar.gz](https://github.com/NVIDIA/TensorRT-LLM/files/15440662/backend.tar.gz)

5. the triton reports:

I0524 22:36:07.146691 392 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7f5c86000000' with size 268435456 I0524 22:36:07.174382 397 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7fb464000000' with size 268435456 I0524 22:36:07.183317 390 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7f69be000000' with size 268435456 I0524 22:36:07.184496 393 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7fb2f4000000' with size 268435456 I0524 22:36:07.185052 394 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7f2396000000' with size 268435456 I0524 22:36:07.185193 395 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7f3e78000000' with size 268435456 I0524 22:36:07.185852 396 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7f25f8000000' with size 268435456 I0524 22:36:07.186180 391 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7fc508000000' with size 268435456 I0524 22:36:07.243768 392 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864 I0524 22:36:07.243785 392 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864 I0524 22:36:07.243790 392 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864 I0524 22:36:07.243794 392 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864 I0524 22:36:07.243799 392 cuda_memory_manager.cc:107] CUDA memory pool is created on device 4 with size 67108864 I0524 22:36:07.243803 392 cuda_memory_manager.cc:107] CUDA memory pool is created on device 5 with size 67108864 I0524 22:36:07.243807 392 cuda_memory_manager.cc:107] CUDA memory pool is created on device 6 with size 67108864 I0524 22:36:07.243811 392 cuda_memory_manager.cc:107] CUDA memory pool is created on device 7 with size 67108864 I0524 22:36:07.244352 397 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864 I0524 22:36:07.244376 397 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864 I0524 22:36:07.244381 397 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864 I0524 22:36:07.244386 397 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864 I0524 22:36:07.244391 397 cuda_memory_manager.cc:107] CUDA memory pool is created on device 4 with size 67108864 I0524 22:36:07.244395 397 cuda_memory_manager.cc:107] CUDA memory pool is created on device 5 with size 67108864 I0524 22:36:07.244399 397 cuda_memory_manager.cc:107] CUDA memory pool is created on device 6 with size 67108864 I0524 22:36:07.244402 397 cuda_memory_manager.cc:107] CUDA memory pool is created on device 7 with size 67108864 I0524 22:36:07.246620 393 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864 I0524 22:36:07.246636 393 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864 I0524 22:36:07.246641 393 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864 I0524 22:36:07.246645 393 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864 I0524 22:36:07.246650 393 cuda_memory_manager.cc:107] CUDA memory pool is created on device 4 with size 67108864 I0524 22:36:07.246654 393 cuda_memory_manager.cc:107] CUDA memory pool is created on device 5 with size 67108864 I0524 22:36:07.246658 393 cuda_memory_manager.cc:107] CUDA memory pool is created on device 6 with size 67108864 I0524 22:36:07.246662 393 cuda_memory_manager.cc:107] CUDA memory pool is created on device 7 with size 67108864 I0524 22:36:07.246708 394 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864 I0524 22:36:07.246725 394 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864 I0524 22:36:07.246730 394 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864 I0524 22:36:07.246734 394 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864 I0524 22:36:07.246738 394 cuda_memory_manager.cc:107] CUDA memory pool is created on device 4 with size 67108864 I0524 22:36:07.246742 394 cuda_memory_manager.cc:107] CUDA memory pool is created on device 5 with size 67108864 I0524 22:36:07.246746 394 cuda_memory_manager.cc:107] CUDA memory pool is created on device 6 with size 67108864 I0524 22:36:07.246749 394 cuda_memory_manager.cc:107] CUDA memory pool is created on device 7 with size 67108864 I0524 22:36:07.246839 395 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864 I0524 22:36:07.246853 395 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864 I0524 22:36:07.246858 395 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864 I0524 22:36:07.246862 395 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864 I0524 22:36:07.246866 395 cuda_memory_manager.cc:107] CUDA memory pool is created on device 4 with size 67108864 I0524 22:36:07.246870 395 cuda_memory_manager.cc:107] CUDA memory pool is created on device 5 with size 67108864 I0524 22:36:07.246874 395 cuda_memory_manager.cc:107] CUDA memory pool is created on device 6 with size 67108864 I0524 22:36:07.246879 395 cuda_memory_manager.cc:107] CUDA memory pool is created on device 7 with size 67108864 I0524 22:36:07.247296 396 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864 I0524 22:36:07.247316 396 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864 I0524 22:36:07.247320 396 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864 I0524 22:36:07.247324 396 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864 I0524 22:36:07.247329 396 cuda_memory_manager.cc:107] CUDA memory pool is created on device 4 with size 67108864 I0524 22:36:07.247332 396 cuda_memory_manager.cc:107] CUDA memory pool is created on device 5 with size 67108864 I0524 22:36:07.247336 396 cuda_memory_manager.cc:107] CUDA memory pool is created on device 6 with size 67108864 I0524 22:36:07.247340 396 cuda_memory_manager.cc:107] CUDA memory pool is created on device 7 with size 67108864 I0524 22:36:07.247561 391 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864 I0524 22:36:07.247577 391 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864 I0524 22:36:07.247581 391 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864 I0524 22:36:07.247585 391 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864 I0524 22:36:07.247590 391 cuda_memory_manager.cc:107] CUDA memory pool is created on device 4 with size 67108864 I0524 22:36:07.247593 391 cuda_memory_manager.cc:107] CUDA memory pool is created on device 5 with size 67108864 I0524 22:36:07.247597 391 cuda_memory_manager.cc:107] CUDA memory pool is created on device 6 with size 67108864 I0524 22:36:07.247601 391 cuda_memory_manager.cc:107] CUDA memory pool is created on device 7 with size 67108864 I0524 22:36:07.255679 390 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864 I0524 22:36:07.255699 390 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864 I0524 22:36:07.255703 390 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864 I0524 22:36:07.255708 390 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864 I0524 22:36:07.255713 390 cuda_memory_manager.cc:107] CUDA memory pool is created on device 4 with size 67108864 I0524 22:36:07.255717 390 cuda_memory_manager.cc:107] CUDA memory pool is created on device 5 with size 67108864 I0524 22:36:07.255720 390 cuda_memory_manager.cc:107] CUDA memory pool is created on device 6 with size 67108864 I0524 22:36:07.255724 390 cuda_memory_manager.cc:107] CUDA memory pool is created on device 7 with size 67108864 W0524 22:36:13.937370 397 server.cc:251] failed to enable peer access for some device pairs I0524 22:36:13.951333 397 model_lifecycle.cc:469] loading: tensorrt_llm:1 W0524 22:36:13.987314 393 server.cc:251] failed to enable peer access for some device pairs I0524 22:36:13.989355 393 model_lifecycle.cc:469] loading: tensorrt_llm:1 W0524 22:36:14.159477 390 server.cc:251] failed to enable peer access for some device pairs I0524 22:36:14.164367 390 model_lifecycle.cc:469] loading: postprocessing:1 I0524 22:36:14.164431 390 model_lifecycle.cc:469] loading: preprocessing:1 I0524 22:36:14.164514 390 model_lifecycle.cc:469] loading: tensorrt_llm:1 W0524 22:36:14.216422 391 server.cc:251] failed to enable peer access for some device pairs I0524 22:36:14.218403 391 model_lifecycle.cc:469] loading: tensorrt_llm:1 W0524 22:36:14.258596 395 server.cc:251] failed to enable peer access for some device pairs I0524 22:36:14.260504 395 model_lifecycle.cc:469] loading: tensorrt_llm:1 W0524 22:36:14.262737 392 server.cc:251] failed to enable peer access for some device pairs I0524 22:36:14.264624 392 model_lifecycle.cc:469] loading: tensorrt_llm:1 W0524 22:36:14.268510 394 server.cc:251] failed to enable peer access for some device pairs W0524 22:36:14.268637 396 server.cc:251] failed to enable peer access for some device pairs I0524 22:36:14.270619 396 model_lifecycle.cc:469] loading: tensorrt_llm:1 I0524 22:36:14.270685 394 model_lifecycle.cc:469] loading: tensorrt_llm:1 [TensorRT-LLM][INFO] Initializing MPI with thread mode 3 [TensorRT-LLM][INFO] Initializing MPI with thread mode 3 [TensorRT-LLM][INFO] Initializing MPI with thread mode 3 [TensorRT-LLM][INFO] Initializing MPI with thread mode 3 [TensorRT-LLM][INFO] Initializing MPI with thread mode 3 [TensorRT-LLM][INFO] Initializing MPI with thread mode 3 [TensorRT-LLM][INFO] Initializing MPI with thread mode 3 [TensorRT-LLM][INFO] Initializing MPI with thread mode 3 [TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set [TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set [TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set [TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set [TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set [TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set [TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict) [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict) [TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false. [TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false [TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict) [TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false. [TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false [TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict) [TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false. [TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false [TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict) [TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false. [TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false [TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict) [TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false. [TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false [TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64 [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8 [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05 [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB [TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false. [TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false [TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64 [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8 [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05 [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB [TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict) [TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false. [TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false [TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64 [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8 [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05 [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB [TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64 [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8 [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05 [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB [TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64 [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8 [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05 [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB [TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64 [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8 [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05 [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB [TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead [TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead [TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64 [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8 [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05 [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB [TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead [TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found [TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found [TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found [TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found [TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found [TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set. [TensorRT-LLM][INFO] MPI size: 8, rank: 4 [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found [TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set. [TensorRT-LLM][INFO] MPI size: 8, rank: 1 [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found [TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set. [TensorRT-LLM][INFO] MPI size: 8, rank: 3 [TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set. [TensorRT-LLM][INFO] MPI size: 8, rank: 5 [TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set. [TensorRT-LLM][INFO] MPI size: 8, rank: 2 [TensorRT-LLM][INFO] MPI size: 8, rank: 6 [TensorRT-LLM][INFO] MPI size: 8, rank: 7 [TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict) [TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false. [TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false [TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64 [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8 [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05 [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB [TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead [TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found [TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set. [TensorRT-LLM][INFO] MPI size: 8, rank: 0 I0524 22:36:16.634983 390 python_be.cc:2404] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0) I0524 22:36:16.635013 390 python_be.cc:2404] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. I0524 22:36:20.245422 390 model_lifecycle.cc:835] successfully loaded 'postprocessing' Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. I0524 22:36:20.252526 390 model_lifecycle.cc:835] successfully loaded 'preprocessing' [TensorRT-LLM][INFO] Rank 4 is using GPU 4 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1536 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][WARNING] Device 4 peer access Device 0 is not available. [TensorRT-LLM][WARNING] Device 4 peer access Device 1 is not available. [TensorRT-LLM][WARNING] Device 4 peer access Device 2 is not available. [TensorRT-LLM][INFO] Loaded engine size: 16321 MiB [TensorRT-LLM][INFO] Rank 2 is using GPU 2 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1536 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][WARNING] Device 2 peer access Device 4 is not available. [TensorRT-LLM][WARNING] Device 2 peer access Device 6 is not available. [TensorRT-LLM][WARNING] Device 2 peer access Device 7 is not available. [TensorRT-LLM][INFO] Loaded engine size: 16321 MiB [TensorRT-LLM][INFO] Rank 3 is using GPU 3 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1536 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][WARNING] Device 3 peer access Device 5 is not available. [TensorRT-LLM][WARNING] Device 3 peer access Device 6 is not available. [TensorRT-LLM][WARNING] Device 3 peer access Device 7 is not available. [TensorRT-LLM][INFO] Loaded engine size: 16321 MiB [TensorRT-LLM][INFO] Rank 5 is using GPU 5 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1536 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][WARNING] Device 5 peer access Device 0 is not available. [TensorRT-LLM][WARNING] Device 5 peer access Device 1 is not available. [TensorRT-LLM][WARNING] Device 5 peer access Device 3 is not available. [TensorRT-LLM][INFO] Loaded engine size: 16321 MiB [TensorRT-LLM][INFO] Rank 1 is using GPU 1 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1536 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][WARNING] Device 1 peer access Device 4 is not available. [TensorRT-LLM][WARNING] Device 1 peer access Device 5 is not available. [TensorRT-LLM][WARNING] Device 1 peer access Device 7 is not available. [TensorRT-LLM][INFO] Loaded engine size: 16321 MiB [TensorRT-LLM][INFO] Rank 6 is using GPU 6 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1536 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][WARNING] Device 6 peer access Device 0 is not available. [TensorRT-LLM][WARNING] Device 6 peer access Device 2 is not available. [TensorRT-LLM][WARNING] Device 6 peer access Device 3 is not available. [TensorRT-LLM][INFO] Loaded engine size: 16321 MiB [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 16428, GPU 19320 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 16429, GPU 19330 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 16428, GPU 19322 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 16429, GPU 19332 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 16428, GPU 19324 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 16429, GPU 19334 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 16428, GPU 19326 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 16429, GPU 19336 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 16428, GPU 19328 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 16429, GPU 19338 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 16428, GPU 19330 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 16429, GPU 19340 (MiB) [TensorRT-LLM][INFO] Rank 7 is using GPU 7 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1536 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][WARNING] Device 7 peer access Device 1 is not available. [TensorRT-LLM][WARNING] Device 7 peer access Device 2 is not available. [TensorRT-LLM][WARNING] Device 7 peer access Device 3 is not available. [TensorRT-LLM][INFO] Loaded engine size: 18325 MiB [TensorRT-LLM][INFO] Rank 0 is using GPU 0 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1536 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][WARNING] Device 0 peer access Device 4 is not available. [TensorRT-LLM][WARNING] Device 0 peer access Device 5 is not available. [TensorRT-LLM][WARNING] Device 0 peer access Device 6 is not available. [TensorRT-LLM][INFO] Loaded engine size: 18325 MiB [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 18432, GPU 21336 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 18434, GPU 21346 (MiB) VM-0-16-ubuntu:390:480 [0] NCCL INFO Bootstrap : Using eth0:10.9.0.16<0> VM-0-16-ubuntu:390:480 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol. VM-0-16-ubuntu:390:480 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6) VM-0-16-ubuntu:390:480 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol. VM-0-16-ubuntu:390:480 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6) VM-0-16-ubuntu:390:480 [0] NCCL INFO cudaDriverVersion 12010 NCCL version 2.19.4+cuda12.3 VM-0-16-ubuntu:391:485 [1] NCCL INFO cudaDriverVersion 12010 VM-0-16-ubuntu:391:485 [1] NCCL INFO Bootstrap : Using eth0:10.9.0.16<0> VM-0-16-ubuntu:391:485 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol. VM-0-16-ubuntu:391:485 [1] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6) VM-0-16-ubuntu:391:485 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol. VM-0-16-ubuntu:391:485 [1] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6) VM-0-16-ubuntu:391:485 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so VM-0-16-ubuntu:391:485 [1] NCCL INFO P2P plugin IBext VM-0-16-ubuntu:391:485 [1] NCCL INFO NET/IB : No device found. VM-0-16-ubuntu:391:485 [1] NCCL INFO NET/IB : No device found. VM-0-16-ubuntu:391:485 [1] NCCL INFO NET/Socket : Using [0]eth0:10.9.0.16<0> [1]vethd658a63:fe80::e815:68ff:fe18:955f%vethd658a63<0> [2]vethb7a0656:fe80::70a3:4fff:fe8f:3a93%vethb7a0656<0> [3]veth3df093d:fe80::842b:55ff:fe4d:6af3%veth3df093d<0> [4]vethfd06bad:fe80::f4dd:16ff:fe63:4b14%vethfd06bad<0> [5]vethb504cd9:fe80::4c3:fff:fee5:5cfa%vethb504cd9<0> [6]veth7897b06:fe80::38c6:a1ff:fe8a:2b4a%veth7897b06<0> VM-0-16-ubuntu:391:485 [1] NCCL INFO Using non-device net plugin version 0 VM-0-16-ubuntu:391:485 [1] NCCL INFO Using network Socket VM-0-16-ubuntu:390:480 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so VM-0-16-ubuntu:390:480 [0] NCCL INFO P2P plugin IBext VM-0-16-ubuntu:390:480 [0] NCCL INFO NET/IB : No device found. VM-0-16-ubuntu:390:480 [0] NCCL INFO NET/IB : No device found. VM-0-16-ubuntu:390:480 [0] NCCL INFO NET/Socket : Using [0]eth0:10.9.0.16<0> [1]vethd658a63:fe80::e815:68ff:fe18:955f%vethd658a63<0> [2]vethb7a0656:fe80::70a3:4fff:fe8f:3a93%vethb7a0656<0> [3]veth3df093d:fe80::842b:55ff:fe4d:6af3%veth3df093d<0> [4]vethfd06bad:fe80::f4dd:16ff:fe63:4b14%vethfd06bad<0> [5]vethb504cd9:fe80::4c3:fff:fee5:5cfa%vethb504cd9<0> [6]veth7897b06:fe80::38c6:a1ff:fe8a:2b4a%veth7897b06<0> VM-0-16-ubuntu:390:480 [0] NCCL INFO Using non-device net plugin version 0 VM-0-16-ubuntu:390:480 [0] NCCL INFO Using network Socket VM-0-16-ubuntu:391:485 [1] NCCL INFO comm 0x7fc43588ac80 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 90 commId 0x66b4a689c4af29a4 - Init START VM-0-16-ubuntu:390:480 [0] NCCL INFO comm 0x7f68e5d736f0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 80 commId 0x66b4a689c4af29a4 - Init START VM-0-16-ubuntu:390:480 [0] NCCL INFO Channel 00/02 : 0 1 VM-0-16-ubuntu:390:480 [0] NCCL INFO Channel 01/02 : 0 1 VM-0-16-ubuntu:390:480 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 VM-0-16-ubuntu:390:480 [0] NCCL INFO P2P Chunksize set to 524288 VM-0-16-ubuntu:391:485 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 VM-0-16-ubuntu:391:485 [1] NCCL INFO P2P Chunksize set to 524288 VM-0-16-ubuntu:391:485 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM VM-0-16-ubuntu:390:480 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM VM-0-16-ubuntu:391:485 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM VM-0-16-ubuntu:390:480 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM VM-0-16-ubuntu:390:480 [0] NCCL INFO Connected all rings VM-0-16-ubuntu:390:480 [0] NCCL INFO Connected all trees VM-0-16-ubuntu:391:485 [1] NCCL INFO Connected all rings VM-0-16-ubuntu:391:485 [1] NCCL INFO Connected all trees VM-0-16-ubuntu:391:485 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 VM-0-16-ubuntu:391:485 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer VM-0-16-ubuntu:390:480 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 VM-0-16-ubuntu:390:480 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer VM-0-16-ubuntu:391:485 [1] NCCL INFO comm 0x7fc43588ac80 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 90 commId 0x66b4a689c4af29a4 - Init COMPLETE VM-0-16-ubuntu:390:480 [0] NCCL INFO comm 0x7f68e5d736f0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 80 commId 0x66b4a689c4af29a4 - Init COMPLETE [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +18324, now: CPU 0, GPU 18324 (MiB) NCCL version 2.19.4+cuda12.3 VM-0-16-ubuntu:391:485 [1] NCCL INFO Using non-device net plugin version 0 VM-0-16-ubuntu:391:485 [1] NCCL INFO Using network Socket VM-0-16-ubuntu:392:494 [2] NCCL INFO cudaDriverVersion 12010 VM-0-16-ubuntu:392:494 [2] NCCL INFO Bootstrap : Using eth0:10.9.0.16<0> VM-0-16-ubuntu:392:494 [2] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol. VM-0-16-ubuntu:392:494 [2] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6) VM-0-16-ubuntu:392:494 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol. VM-0-16-ubuntu:392:494 [2] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6)

VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'

VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle' VM-0-16-ubuntu:392:494 [2] NCCL INFO init.cc:1364 -> 1 VM-0-16-ubuntu:392:494 [2] NCCL INFO init.cc:1635 -> 1 VM-0-16-ubuntu:392:494 [2] NCCL INFO init.cc:1673 -> 1 Failed, NCCL error /tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/plugins/ncclPlugin/recvPlugin.cpp:132 'unhandled cuda error (run with NCCL_DEBUG=INFO for details)' [TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] CUDA runtime error in cublasCreate(handle.get()): CUBLAS_STATUS_ALLOC_FAILED (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/plugins/common/plugin.cpp:190) 1 0x7fb30cb0ef12 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so.9(+0x57f12) [0x7fb30cb0ef12] 2 0x7fb30cc4f693 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so.9(+0x198693) [0x7fb30cc4f693] 3 0x7fb30cc1d049 tensorrt_llm::plugins::GemmPlugin::init() + 41 4 0x7fb30cc1db9a tensorrt_llm::plugins::GemmPlugin::GemmPlugin(void const, unsigned long, std::shared_ptr const&) + 298 5 0x7fb30cc1dcef tensorrt_llm::plugins::GemmPluginCreator::deserializePlugin(char const, void const, unsigned long) + 191 6 0x7fb2c86f1506 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10d8506) [0x7fb2c86f1506] 7 0x7fb2c86fe0ae /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10e50ae) [0x7fb2c86fe0ae] 8 0x7fb2c8686e17 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x106de17) [0x7fb2c8686e17] 9 0x7fb2c8684d9e /usr/local/tensorrt/lib/libnvinfer.so.9(+0x106bd9e) [0x7fb2c8684d9e] 10 0x7fb2c869cc8b /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1083c8b) [0x7fb2c869cc8b] 11 0x7fb2c869ff12 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1086f12) [0x7fb2c869ff12] 12 0x7fb2c86a02ec /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10872ec) [0x7fb2c86a02ec] 13 0x7fb2c86d39b1 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10ba9b1) [0x7fb2c86d39b1] 14 0x7fb2c86d4777 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10bb777) [0x7fb2c86d4777] 15 0x7fb3887b6f52 tensorrt_llm::runtime::TllmRuntime::TllmRuntime(void const, unsigned long, nvinfer1::ILogger&) + 482 16 0x7fb38884e6b6 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(int, std::shared_ptr, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, std::vector<unsigned char, std::allocator > const&, bool, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1222 17 0x7fb38880cd5a tensorrt_llm::batch_manager::TrtGptModelFactory::create(std::filesystem::cxx11::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1930 18 0x7fb388804170 tensorrt_llm::batch_manager::GptManager::GptManager(std::filesystem::__cxx11::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, std::function<std::cxx11::list<std::shared_ptr, std::allocator<std::shared_ptr > > (int)>, std::function<void (unsigned long, std::cxx11::list<tensorrt_llm::batch_manager::NamedTensor, std::allocator > const&, bool, std::cxx11::basic_string<char, std::char_traits, std::allocator > const&)>, std::function<std::unordered_set<unsigned long, std::hash, std::equal_to, std::allocator > ()>, std::function<void (std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&)>, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&, std::optional, std::optional, bool) + 336 19 0x7fb4a0108075 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState, TRITONBACKEND_ModelInstance, ompi_communicator_t) + 4901 20 0x7fb4a0109019 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 73 21 0x7fb4a014741c TRITONBACKEND_ModelInstanceInitialize + 828 22 0x7fb4ae124086 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1af086) [0x7fb4ae124086] 23 0x7fb4ae1252c6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1b02c6) [0x7fb4ae1252c6] 24 0x7fb4ae1078d5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1928d5) [0x7fb4ae1078d5] 25 0x7fb4ae107f16 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x192f16) [0x7fb4ae107f16] 26 0x7fb4ae11480d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19f80d) [0x7fb4ae11480d] 27 0x7fb4ad776ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7fb4ad776ee8] 28 0x7fb4ae0fe64b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18964b) [0x7fb4ae0fe64b] 29 0x7fb4ae10f4f5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19a4f5) [0x7fb4ae10f4f5] 30 0x7fb4ae113c2e /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19ec2e) [0x7fb4ae113c2e] 31 0x7fb4ae208318 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x293318) [0x7fb4ae208318] 32 0x7fb4ae20bbfc /opt/tritonserver/bin/../lib/libtritonserver.so(+0x296bfc) [0x7fb4ae20bbfc] 33 0x7fb4ae367a02 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3f2a02) [0x7fb4ae367a02] 34 0x7fb4ad9e2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fb4ad9e2253] 35 0x7fb4ad771ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fb4ad771ac3] 36 0x7fb4ad803850 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7fb4ad803850] [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 18432, GPU 21472 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 18433, GPU 21482 (MiB)



### Expected behavior

triton server load successfully

### actual behavior

triton server failed in loading

### additional notes

I tried tp=2 pp=4, this works, but it fails to load when pp=8
byshiue commented 1 month ago

It looks the program really OOM because in PP, first GPU and last GPU often require more memory. Could you try using smaller batch size, input length, output length or try GPU with larger memory?