nccl reports 'out of memory' when deploy llama3 to triton on 8xV100

System Info

-CPU: x86

Memory: over 300G
GPU: 8 x V100
No IB, No nvlink, NCCL use socket for communication

Driver:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-32GB            On | 00000000:00:08.0 Off |                    0 |
| N/A   44C    P0               57W / 300W|  32498MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2-32GB            On | 00000000:00:09.0 Off |                    0 |
| N/A   43C    P0               58W / 300W|  19530MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2-32GB            On | 00000000:00:0A.0 Off |                    0 |
| N/A   42C    P0               57W / 300W|  19474MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2-32GB            On | 00000000:00:0B.0 Off |                    0 |
| N/A   44C    P0               57W / 300W|  19476MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2-32GB            On | 00000000:00:0C.0 Off |                    0 |
| N/A   43C    P0               59W / 300W|  19476MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2-32GB            On | 00000000:00:0D.0 Off |                    0 |
| N/A   42C    P0               58W / 300W|  19476MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2-32GB            On | 00000000:00:0E.0 Off |                    0 |
| N/A   44C    P0               61W / 300W|  19476MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2-32GB            On | 00000000:00:0F.0 Off |                    0 |
| N/A   45C    P0               62W / 300W|  21482MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

Who can help?

@byshiue

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

checkout tensorrt-llm and backend v0.9.0, and follow https://developer.nvidia.com/zh-cn/blog/turbocharging-meta-llama-3-performance-with-nvidia-tensorrt-llm-and-nvidia-triton-inference-server/
build llama-3 engine in nvidia/cuda:12.1.0-devel-ubuntu22.04:

model=/home/jgq/cloud/models/llama3-70b-instruct checkpoint=/home/jgq/cloud/engines/llama3-70b-instruct/cvt-tp output=/home/jgq/cloud/engines/llama3-70b-instruct/engines-tp rm -rf $checkpoint && mkdir $checkpoint rm -rf $output && mkdir $output tp=2 pp=4 dtype=float16

cd $trtllm/examples/llama python3 convert_checkpoint.py --model_dir $model \ --output_dir $checkpoint \ --tp_size $tp \ --pp_size $pp \ --dtype $dtype

trtllm-build --checkpoint_dir $checkpoint \ --output_dir $output \ --max_batch_size 64 \ --max_input_len 1024 \ --max_output_len 512 \ --tp_size $tp \ --pp_size $pp \ --gpt_attention_plugin $dtype \ --gemm_plugin $dtype


3. run triton docker

image=nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3 name=triton-pp8

nvidia-docker run -e DISPLAY=unix: -it --net=host --ulimit core=-1 --ulimit memlock=-1 \ --security-opt seccomp=unconfined --detach-keys=ctrl-i,c \ --shm-size='10g' --ipc=host \ -v /home/jgq:/home/jgq \ -v /var/run/docker.sock:/var/run/docker.sock \ -v /usr/bin/docker:/usr/bin/docker \ -v /tmp/.X11-unix/:/tmp/.X11-unix/ \ -w /home/jgq \ --privileged -v /etc/timezone:/etc/timezone:ro \ --name $name $image /bin/bash


4. in triton docker, run triton server:

export NCCL_DEBUG=INFO python3 /home/jgq/tensorrtllm_backend/scripts/launch_triton_server.py --model_repo /home/jgq/tensorrtllm_backend/all_models/inflight_batcher_llm_pp8 --world_size 8


the backend pipeline is attached:
[backend.tar.gz](https://github.com/NVIDIA/TensorRT-LLM/files/15440662/backend.tar.gz)

5. the triton reports:

I0524 22:36:07.146691 392 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7f5c86000000' with size 268435456 I0524 22:36:07.174382 397 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7fb464000000' with size 268435456 I0524 22:36:07.183317 390 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7f69be000000' with size 268435456 I0524 22:36:07.184496 393 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7fb2f4000000' with size 268435456 I0524 22:36:07.185052 394 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7f2396000000' with size 268435456 I0524 22:36:07.185193 395 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7f3e78000000' with size 268435456 I0524 22:36:07.185852 396 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7f25f8000000' with size 268435456 I0524 22:36:07.186180 391 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7fc508000000' with size 268435456 I0524 22:36:07.243768 392 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864 I0524 22:36:07.243785 392 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864 I0524 22:36:07.243790 392 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864 I0524 22:36:07.243794 392 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864 I0524 22:36:07.243799 392 cuda_memory_manager.cc:107] CUDA memory pool is created on device 4 with size 67108864 I0524 22:36:07.243803 392 cuda_memory_manager.cc:107] CUDA memory pool is created on device 5 with size 67108864 I0524 22:36:07.243807 392 cuda_memory_manager.cc:107] CUDA memory pool is created on device 6 with size 67108864 I0524 22:36:07.243811 392 cuda_memory_manager.cc:107] CUDA memory pool is created on device 7 with size 67108864 I0524 22:36:07.244352 397 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864 I0524 22:36:07.244376 397 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864 I0524 22:36:07.244381 397 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864 I0524 22:36:07.244386 397 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864 I0524 22:36:07.244391 397 cuda_memory_manager.cc:107] CUDA memory pool is created on device 4 with size 67108864 I0524 22:36:07.244395 397 cuda_memory_manager.cc:107] CUDA memory pool is created on device 5 with size 67108864 I0524 22:36:07.244399 397 cuda_memory_manager.cc:107] CUDA memory pool is created on device 6 with size 67108864 I0524 22:36:07.244402 397 cuda_memory_manager.cc:107] CUDA memory pool is created on device 7 with size 67108864 I0524 22:36:07.246620 393 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864 I0524 22:36:07.246636 393 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864 I0524 22:36:07.246641 393 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864 I0524 22:36:07.246645 393 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864 I0524 22:36:07.246650 393 cuda_memory_manager.cc:107] CUDA memory pool is created on device 4 with size 67108864 I0524 22:36:07.246654 393 cuda_memory_manager.cc:107] CUDA memory pool is created on device 5 with size 67108864 I0524 22:36:07.246658 393 cuda_memory_manager.cc:107] CUDA memory pool is created on device 6 with size 67108864 I0524 22:36:07.246662 393 cuda_memory_manager.cc:107] CUDA memory pool is created on device 7 with size 67108864 I0524 22:36:07.246708 394 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864 I0524 22:36:07.246725 394 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864 I0524 22:36:07.246730 394 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864 I0524 22:36:07.246734 394 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864 I0524 22:36:07.246738 394 cuda_memory_manager.cc:107] CUDA memory pool is created on device 4 with size 67108864 I0524 22:36:07.246742 394 cuda_memory_manager.cc:107] CUDA memory pool is created on device 5 with size 67108864 I0524 22:36:07.246746 394 cuda_memory_manager.cc:107] CUDA memory pool is created on device 6 with size 67108864 I0524 22:36:07.246749 394 cuda_memory_manager.cc:107] CUDA memory pool is created on device 7 with size 67108864 I0524 22:36:07.246839 395 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864 I0524 22:36:07.246853 395 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864 I0524 22:36:07.246858 395 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864 I0524 22:36:07.246862 395 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864 I0524 22:36:07.246866 395 cuda_memory_manager.cc:107] CUDA memory pool is created on device 4 with size 67108864 I0524 22:36:07.246870 395 cuda_memory_manager.cc:107] CUDA memory pool is created on device 5 with size 67108864 I0524 22:36:07.246874 395 cuda_memory_manager.cc:107] CUDA memory pool is created on device 6 with size 67108864 I0524 22:36:07.246879 395 cuda_memory_manager.cc:107] CUDA memory pool is created on device 7 with size 67108864 I0524 22:36:07.247296 396 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864 I0524 22:36:07.247316 396 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864 I0524 22:36:07.247320 396 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864 I0524 22:36:07.247324 396 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864 I0524 22:36:07.247329 396 cuda_memory_manager.cc:107] CUDA memory pool is created on device 4 with size 67108864 I0524 22:36:07.247332 396 cuda_memory_manager.cc:107] CUDA memory pool is created on device 5 with size 67108864 I0524 22:36:07.247336 396 cuda_memory_manager.cc:107] CUDA memory pool is created on device 6 with size 67108864 I0524 22:36:07.247340 396 cuda_memory_manager.cc:107] CUDA memory pool is created on device 7 with size 67108864 I0524 22:36:07.247561 391 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864 I0524 22:36:07.247577 391 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864 I0524 22:36:07.247581 391 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864 I0524 22:36:07.247585 391 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864 I0524 22:36:07.247590 391 cuda_memory_manager.cc:107] CUDA memory pool is created on device 4 with size 67108864 I0524 22:36:07.247593 391 cuda_memory_manager.cc:107] CUDA memory pool is created on device 5 with size 67108864 I0524 22:36:07.247597 391 cuda_memory_manager.cc:107] CUDA memory pool is created on device 6 with size 67108864 I0524 22:36:07.247601 391 cuda_memory_manager.cc:107] CUDA memory pool is created on device 7 with size 67108864 I0524 22:36:07.255679 390 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864 I0524 22:36:07.255699 390 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864 I0524 22:36:07.255703 390 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864 I0524 22:36:07.255708 390 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864 I0524 22:36:07.255713 390 cuda_memory_manager.cc:107] CUDA memory pool is created on device 4 with size 67108864 I0524 22:36:07.255717 390 cuda_memory_manager.cc:107] CUDA memory pool is created on device 5 with size 67108864 I0524 22:36:07.255720 390 cuda_memory_manager.cc:107] CUDA memory pool is created on device 6 with size 67108864 I0524 22:36:07.255724 390 cuda_memory_manager.cc:107] CUDA memory pool is created on device 7 with size 67108864 W0524 22:36:13.937370 397 server.cc:251] failed to enable peer access for some device pairs I0524 22:36:13.951333 397 model_lifecycle.cc:469] loading: tensorrt_llm:1 W0524 22:36:13.987314 393 server.cc:251] failed to enable peer access for some device pairs I0524 22:36:13.989355 393 model_lifecycle.cc:469] loading: tensorrt_llm:1 W0524 22:36:14.159477 390 server.cc:251] failed to enable peer access for some device pairs I0524 22:36:14.164367 390 model_lifecycle.cc:469] loading: postprocessing:1 I0524 22:36:14.164431 390 model_lifecycle.cc:469] loading: preprocessing:1 I0524 22:36:14.164514 390 model_lifecycle.cc:469] loading: tensorrt_llm:1 W0524 22:36:14.216422 391 server.cc:251] failed to enable peer access for some device pairs I0524 22:36:14.218403 391 model_lifecycle.cc:469] loading: tensorrt_llm:1 W0524 22:36:14.258596 395 server.cc:251] failed to enable peer access for some device pairs I0524 22:36:14.260504 395 model_lifecycle.cc:469] loading: tensorrt_llm:1 W0524 22:36:14.262737 392 server.cc:251] failed to enable peer access for some device pairs I0524 22:36:14.264624 392 model_lifecycle.cc:469] loading: tensorrt_llm:1 W0524 22:36:14.268510 394 server.cc:251] failed to enable peer access for some device pairs W0524 22:36:14.268637 396 server.cc:251] failed to enable peer access for some device pairs I0524 22:36:14.270619 396 model_lifecycle.cc:469] loading: tensorrt_llm:1 I0524 22:36:14.270685 394 model_lifecycle.cc:469] loading: tensorrt_llm:1 [TensorRT-LLM][INFO] Initializing MPI with thread mode 3 [TensorRT-LLM][INFO] Initializing MPI with thread mode 3 [TensorRT-LLM][INFO] Initializing MPI with thread mode 3 [TensorRT-LLM][INFO] Initializing MPI with thread mode 3 [TensorRT-LLM][INFO] Initializing MPI with thread mode 3 [TensorRT-LLM][INFO] Initializing MPI with thread mode 3 [TensorRT-LLM][INFO] Initializing MPI with thread mode 3 [TensorRT-LLM][INFO] Initializing MPI with thread mode 3 [TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set [TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set [TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set [TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set [TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set [TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set [TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict) [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict) [TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false. [TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false [TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict) [TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false. [TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false [TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict) [TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false. [TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false [TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict) [TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false. [TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false [TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict) [TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false. [TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false [TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64 [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8 [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05 [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB [TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false. [TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false [TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64 [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8 [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05 [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB [TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict) [TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false. [TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false [TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64 [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8 [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05 [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB [TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64 [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8 [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05 [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB [TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64 [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8 [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05 [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB [TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64 [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8 [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05 [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB [TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead [TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead [TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64 [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8 [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05 [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB [TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead [TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found [TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found [TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found [TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found [TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found [TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set. [TensorRT-LLM][INFO] MPI size: 8, rank: 4 [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found [TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set. [TensorRT-LLM][INFO] MPI size: 8, rank: 1 [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found [TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set. [TensorRT-LLM][INFO] MPI size: 8, rank: 3 [TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set. [TensorRT-LLM][INFO] MPI size: 8, rank: 5 [TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set. [TensorRT-LLM][INFO] MPI size: 8, rank: 2 [TensorRT-LLM][INFO] MPI size: 8, rank: 6 [TensorRT-LLM][INFO] MPI size: 8, rank: 7 [TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict) [TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false. [TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false [TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64 [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8 [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05 [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB [TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead [TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found [TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set. [TensorRT-LLM][INFO] MPI size: 8, rank: 0 I0524 22:36:16.634983 390 python_be.cc:2404] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0) I0524 22:36:16.635013 390 python_be.cc:2404] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. I0524 22:36:20.245422 390 model_lifecycle.cc:835] successfully loaded 'postprocessing' Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. I0524 22:36:20.252526 390 model_lifecycle.cc:835] successfully loaded 'preprocessing' [TensorRT-LLM][INFO] Rank 4 is using GPU 4 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1536 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][WARNING] Device 4 peer access Device 0 is not available. [TensorRT-LLM][WARNING] Device 4 peer access Device 1 is not available. [TensorRT-LLM][WARNING] Device 4 peer access Device 2 is not available. [TensorRT-LLM][INFO] Loaded engine size: 16321 MiB [TensorRT-LLM][INFO] Rank 2 is using GPU 2 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1536 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][WARNING] Device 2 peer access Device 4 is not available. [TensorRT-LLM][WARNING] Device 2 peer access Device 6 is not available. [TensorRT-LLM][WARNING] Device 2 peer access Device 7 is not available. [TensorRT-LLM][INFO] Loaded engine size: 16321 MiB [TensorRT-LLM][INFO] Rank 3 is using GPU 3 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1536 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][WARNING] Device 3 peer access Device 5 is not available. [TensorRT-LLM][WARNING] Device 3 peer access Device 6 is not available. [TensorRT-LLM][WARNING] Device 3 peer access Device 7 is not available. [TensorRT-LLM][INFO] Loaded engine size: 16321 MiB [TensorRT-LLM][INFO] Rank 5 is using GPU 5 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1536 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][WARNING] Device 5 peer access Device 0 is not available. [TensorRT-LLM][WARNING] Device 5 peer access Device 1 is not available. [TensorRT-LLM][WARNING] Device 5 peer access Device 3 is not available. [TensorRT-LLM][INFO] Loaded engine size: 16321 MiB [TensorRT-LLM][INFO] Rank 1 is using GPU 1 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1536 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][WARNING] Device 1 peer access Device 4 is not available. [TensorRT-LLM][WARNING] Device 1 peer access Device 5 is not available. [TensorRT-LLM][WARNING] Device 1 peer access Device 7 is not available. [TensorRT-LLM][INFO] Loaded engine size: 16321 MiB [TensorRT-LLM][INFO] Rank 6 is using GPU 6 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1536 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][WARNING] Device 6 peer access Device 0 is not available. [TensorRT-LLM][WARNING] Device 6 peer access Device 2 is not available. [TensorRT-LLM][WARNING] Device 6 peer access Device 3 is not available. [TensorRT-LLM][INFO] Loaded engine size: 16321 MiB [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 16428, GPU 19320 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 16429, GPU 19330 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 16428, GPU 19322 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 16429, GPU 19332 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 16428, GPU 19324 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 16429, GPU 19334 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 16428, GPU 19326 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 16429, GPU 19336 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 16428, GPU 19328 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 16429, GPU 19338 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 16428, GPU 19330 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 16429, GPU 19340 (MiB) [TensorRT-LLM][INFO] Rank 7 is using GPU 7 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1536 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][WARNING] Device 7 peer access Device 1 is not available. [TensorRT-LLM][WARNING] Device 7 peer access Device 2 is not available. [TensorRT-LLM][WARNING] Device 7 peer access Device 3 is not available. [TensorRT-LLM][INFO] Loaded engine size: 18325 MiB [TensorRT-LLM][INFO] Rank 0 is using GPU 0 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1536 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][WARNING] Device 0 peer access Device 4 is not available. [TensorRT-LLM][WARNING] Device 0 peer access Device 5 is not available. [TensorRT-LLM][WARNING] Device 0 peer access Device 6 is not available. [TensorRT-LLM][INFO] Loaded engine size: 18325 MiB [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 18432, GPU 21336 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 18434, GPU 21346 (MiB) VM-0-16-ubuntu:390:480 [0] NCCL INFO Bootstrap : Using eth0:10.9.0.16<0> VM-0-16-ubuntu:390:480 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol. VM-0-16-ubuntu:390:480 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6) VM-0-16-ubuntu:390:480 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol. VM-0-16-ubuntu:390:480 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6) VM-0-16-ubuntu:390:480 [0] NCCL INFO cudaDriverVersion 12010 NCCL version 2.19.4+cuda12.3 VM-0-16-ubuntu:391:485 [1] NCCL INFO cudaDriverVersion 12010 VM-0-16-ubuntu:391:485 [1] NCCL INFO Bootstrap : Using eth0:10.9.0.16<0> VM-0-16-ubuntu:391:485 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol. VM-0-16-ubuntu:391:485 [1] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6) VM-0-16-ubuntu:391:485 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol. VM-0-16-ubuntu:391:485 [1] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6) VM-0-16-ubuntu:391:485 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so VM-0-16-ubuntu:391:485 [1] NCCL INFO P2P plugin IBext VM-0-16-ubuntu:391:485 [1] NCCL INFO NET/IB : No device found. VM-0-16-ubuntu:391:485 [1] NCCL INFO NET/IB : No device found. VM-0-16-ubuntu:391:485 [1] NCCL INFO NET/Socket : Using [0]eth0:10.9.0.16<0> [1]vethd658a63:fe80::e815:68ff:fe18:955f%vethd658a63<0> [2]vethb7a0656:fe80::70a3:4fff:fe8f:3a93%vethb7a0656<0> [3]veth3df093d:fe80::842b:55ff:fe4d:6af3%veth3df093d<0> [4]vethfd06bad:fe80::f4dd:16ff:fe63:4b14%vethfd06bad<0> [5]vethb504cd9:fe80::4c3:fff:fee5:5cfa%vethb504cd9<0> [6]veth7897b06:fe80::38c6:a1ff:fe8a:2b4a%veth7897b06<0> VM-0-16-ubuntu:391:485 [1] NCCL INFO Using non-device net plugin version 0 VM-0-16-ubuntu:391:485 [1] NCCL INFO Using network Socket VM-0-16-ubuntu:390:480 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so VM-0-16-ubuntu:390:480 [0] NCCL INFO P2P plugin IBext VM-0-16-ubuntu:390:480 [0] NCCL INFO NET/IB : No device found. VM-0-16-ubuntu:390:480 [0] NCCL INFO NET/IB : No device found. VM-0-16-ubuntu:390:480 [0] NCCL INFO NET/Socket : Using [0]eth0:10.9.0.16<0> [1]vethd658a63:fe80::e815:68ff:fe18:955f%vethd658a63<0> [2]vethb7a0656:fe80::70a3:4fff:fe8f:3a93%vethb7a0656<0> [3]veth3df093d:fe80::842b:55ff:fe4d:6af3%veth3df093d<0> [4]vethfd06bad:fe80::f4dd:16ff:fe63:4b14%vethfd06bad<0> [5]vethb504cd9:fe80::4c3:fff:fee5:5cfa%vethb504cd9<0> [6]veth7897b06:fe80::38c6:a1ff:fe8a:2b4a%veth7897b06<0> VM-0-16-ubuntu:390:480 [0] NCCL INFO Using non-device net plugin version 0 VM-0-16-ubuntu:390:480 [0] NCCL INFO Using network Socket VM-0-16-ubuntu:391:485 [1] NCCL INFO comm 0x7fc43588ac80 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 90 commId 0x66b4a689c4af29a4 - Init START VM-0-16-ubuntu:390:480 [0] NCCL INFO comm 0x7f68e5d736f0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 80 commId 0x66b4a689c4af29a4 - Init START VM-0-16-ubuntu:390:480 [0] NCCL INFO Channel 00/02 : 0 1 VM-0-16-ubuntu:390:480 [0] NCCL INFO Channel 01/02 : 0 1 VM-0-16-ubuntu:390:480 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 VM-0-16-ubuntu:390:480 [0] NCCL INFO P2P Chunksize set to 524288 VM-0-16-ubuntu:391:485 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 VM-0-16-ubuntu:391:485 [1] NCCL INFO P2P Chunksize set to 524288 VM-0-16-ubuntu:391:485 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM VM-0-16-ubuntu:390:480 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM VM-0-16-ubuntu:391:485 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM VM-0-16-ubuntu:390:480 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM VM-0-16-ubuntu:390:480 [0] NCCL INFO Connected all rings VM-0-16-ubuntu:390:480 [0] NCCL INFO Connected all trees VM-0-16-ubuntu:391:485 [1] NCCL INFO Connected all rings VM-0-16-ubuntu:391:485 [1] NCCL INFO Connected all trees VM-0-16-ubuntu:391:485 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 VM-0-16-ubuntu:391:485 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer VM-0-16-ubuntu:390:480 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 VM-0-16-ubuntu:390:480 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer VM-0-16-ubuntu:391:485 [1] NCCL INFO comm 0x7fc43588ac80 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 90 commId 0x66b4a689c4af29a4 - Init COMPLETE VM-0-16-ubuntu:390:480 [0] NCCL INFO comm 0x7f68e5d736f0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 80 commId 0x66b4a689c4af29a4 - Init COMPLETE [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +18324, now: CPU 0, GPU 18324 (MiB) NCCL version 2.19.4+cuda12.3 VM-0-16-ubuntu:391:485 [1] NCCL INFO Using non-device net plugin version 0 VM-0-16-ubuntu:391:485 [1] NCCL INFO Using network Socket VM-0-16-ubuntu:392:494 [2] NCCL INFO cudaDriverVersion 12010 VM-0-16-ubuntu:392:494 [2] NCCL INFO Bootstrap : Using eth0:10.9.0.16<0> VM-0-16-ubuntu:392:494 [2] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol. VM-0-16-ubuntu:392:494 [2] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6) VM-0-16-ubuntu:392:494 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol. VM-0-16-ubuntu:392:494 [2] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6)