TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
the backend pipeline is attached:
[backend.tar.gz](https://github.com/NVIDIA/TensorRT-LLM/files/15440662/backend.tar.gz)
5. the triton reports:
I0524 22:36:07.146691 392 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7f5c86000000' with size 268435456
I0524 22:36:07.174382 397 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7fb464000000' with size 268435456
I0524 22:36:07.183317 390 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7f69be000000' with size 268435456
I0524 22:36:07.184496 393 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7fb2f4000000' with size 268435456
I0524 22:36:07.185052 394 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7f2396000000' with size 268435456
I0524 22:36:07.185193 395 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7f3e78000000' with size 268435456
I0524 22:36:07.185852 396 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7f25f8000000' with size 268435456
I0524 22:36:07.186180 391 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7fc508000000' with size 268435456
I0524 22:36:07.243768 392 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0524 22:36:07.243785 392 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I0524 22:36:07.243790 392 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864
I0524 22:36:07.243794 392 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864
I0524 22:36:07.243799 392 cuda_memory_manager.cc:107] CUDA memory pool is created on device 4 with size 67108864
I0524 22:36:07.243803 392 cuda_memory_manager.cc:107] CUDA memory pool is created on device 5 with size 67108864
I0524 22:36:07.243807 392 cuda_memory_manager.cc:107] CUDA memory pool is created on device 6 with size 67108864
I0524 22:36:07.243811 392 cuda_memory_manager.cc:107] CUDA memory pool is created on device 7 with size 67108864
I0524 22:36:07.244352 397 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0524 22:36:07.244376 397 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I0524 22:36:07.244381 397 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864
I0524 22:36:07.244386 397 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864
I0524 22:36:07.244391 397 cuda_memory_manager.cc:107] CUDA memory pool is created on device 4 with size 67108864
I0524 22:36:07.244395 397 cuda_memory_manager.cc:107] CUDA memory pool is created on device 5 with size 67108864
I0524 22:36:07.244399 397 cuda_memory_manager.cc:107] CUDA memory pool is created on device 6 with size 67108864
I0524 22:36:07.244402 397 cuda_memory_manager.cc:107] CUDA memory pool is created on device 7 with size 67108864
I0524 22:36:07.246620 393 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0524 22:36:07.246636 393 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I0524 22:36:07.246641 393 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864
I0524 22:36:07.246645 393 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864
I0524 22:36:07.246650 393 cuda_memory_manager.cc:107] CUDA memory pool is created on device 4 with size 67108864
I0524 22:36:07.246654 393 cuda_memory_manager.cc:107] CUDA memory pool is created on device 5 with size 67108864
I0524 22:36:07.246658 393 cuda_memory_manager.cc:107] CUDA memory pool is created on device 6 with size 67108864
I0524 22:36:07.246662 393 cuda_memory_manager.cc:107] CUDA memory pool is created on device 7 with size 67108864
I0524 22:36:07.246708 394 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0524 22:36:07.246725 394 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I0524 22:36:07.246730 394 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864
I0524 22:36:07.246734 394 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864
I0524 22:36:07.246738 394 cuda_memory_manager.cc:107] CUDA memory pool is created on device 4 with size 67108864
I0524 22:36:07.246742 394 cuda_memory_manager.cc:107] CUDA memory pool is created on device 5 with size 67108864
I0524 22:36:07.246746 394 cuda_memory_manager.cc:107] CUDA memory pool is created on device 6 with size 67108864
I0524 22:36:07.246749 394 cuda_memory_manager.cc:107] CUDA memory pool is created on device 7 with size 67108864
I0524 22:36:07.246839 395 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0524 22:36:07.246853 395 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I0524 22:36:07.246858 395 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864
I0524 22:36:07.246862 395 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864
I0524 22:36:07.246866 395 cuda_memory_manager.cc:107] CUDA memory pool is created on device 4 with size 67108864
I0524 22:36:07.246870 395 cuda_memory_manager.cc:107] CUDA memory pool is created on device 5 with size 67108864
I0524 22:36:07.246874 395 cuda_memory_manager.cc:107] CUDA memory pool is created on device 6 with size 67108864
I0524 22:36:07.246879 395 cuda_memory_manager.cc:107] CUDA memory pool is created on device 7 with size 67108864
I0524 22:36:07.247296 396 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0524 22:36:07.247316 396 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I0524 22:36:07.247320 396 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864
I0524 22:36:07.247324 396 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864
I0524 22:36:07.247329 396 cuda_memory_manager.cc:107] CUDA memory pool is created on device 4 with size 67108864
I0524 22:36:07.247332 396 cuda_memory_manager.cc:107] CUDA memory pool is created on device 5 with size 67108864
I0524 22:36:07.247336 396 cuda_memory_manager.cc:107] CUDA memory pool is created on device 6 with size 67108864
I0524 22:36:07.247340 396 cuda_memory_manager.cc:107] CUDA memory pool is created on device 7 with size 67108864
I0524 22:36:07.247561 391 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0524 22:36:07.247577 391 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I0524 22:36:07.247581 391 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864
I0524 22:36:07.247585 391 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864
I0524 22:36:07.247590 391 cuda_memory_manager.cc:107] CUDA memory pool is created on device 4 with size 67108864
I0524 22:36:07.247593 391 cuda_memory_manager.cc:107] CUDA memory pool is created on device 5 with size 67108864
I0524 22:36:07.247597 391 cuda_memory_manager.cc:107] CUDA memory pool is created on device 6 with size 67108864
I0524 22:36:07.247601 391 cuda_memory_manager.cc:107] CUDA memory pool is created on device 7 with size 67108864
I0524 22:36:07.255679 390 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0524 22:36:07.255699 390 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I0524 22:36:07.255703 390 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864
I0524 22:36:07.255708 390 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864
I0524 22:36:07.255713 390 cuda_memory_manager.cc:107] CUDA memory pool is created on device 4 with size 67108864
I0524 22:36:07.255717 390 cuda_memory_manager.cc:107] CUDA memory pool is created on device 5 with size 67108864
I0524 22:36:07.255720 390 cuda_memory_manager.cc:107] CUDA memory pool is created on device 6 with size 67108864
I0524 22:36:07.255724 390 cuda_memory_manager.cc:107] CUDA memory pool is created on device 7 with size 67108864
W0524 22:36:13.937370 397 server.cc:251] failed to enable peer access for some device pairs
I0524 22:36:13.951333 397 model_lifecycle.cc:469] loading: tensorrt_llm:1
W0524 22:36:13.987314 393 server.cc:251] failed to enable peer access for some device pairs
I0524 22:36:13.989355 393 model_lifecycle.cc:469] loading: tensorrt_llm:1
W0524 22:36:14.159477 390 server.cc:251] failed to enable peer access for some device pairs
I0524 22:36:14.164367 390 model_lifecycle.cc:469] loading: postprocessing:1
I0524 22:36:14.164431 390 model_lifecycle.cc:469] loading: preprocessing:1
I0524 22:36:14.164514 390 model_lifecycle.cc:469] loading: tensorrt_llm:1
W0524 22:36:14.216422 391 server.cc:251] failed to enable peer access for some device pairs
I0524 22:36:14.218403 391 model_lifecycle.cc:469] loading: tensorrt_llm:1
W0524 22:36:14.258596 395 server.cc:251] failed to enable peer access for some device pairs
I0524 22:36:14.260504 395 model_lifecycle.cc:469] loading: tensorrt_llm:1
W0524 22:36:14.262737 392 server.cc:251] failed to enable peer access for some device pairs
I0524 22:36:14.264624 392 model_lifecycle.cc:469] loading: tensorrt_llm:1
W0524 22:36:14.268510 394 server.cc:251] failed to enable peer access for some device pairs
W0524 22:36:14.268637 396 server.cc:251] failed to enable peer access for some device pairs
I0524 22:36:14.270619 396 model_lifecycle.cc:469] loading: tensorrt_llm:1
I0524 22:36:14.270685 394 model_lifecycle.cc:469] loading: tensorrt_llm:1
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead
[TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead
[TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead
[TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] MPI size: 8, rank: 4
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] MPI size: 8, rank: 1
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] MPI size: 8, rank: 3
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] MPI size: 8, rank: 5
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] MPI size: 8, rank: 2
[TensorRT-LLM][INFO] MPI size: 8, rank: 6
[TensorRT-LLM][INFO] MPI size: 8, rank: 7
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead
[TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] MPI size: 8, rank: 0
I0524 22:36:16.634983 390 python_be.cc:2404] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
I0524 22:36:16.635013 390 python_be.cc:2404] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
I0524 22:36:20.245422 390 model_lifecycle.cc:835] successfully loaded 'postprocessing'
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
I0524 22:36:20.252526 390 model_lifecycle.cc:835] successfully loaded 'preprocessing'
[TensorRT-LLM][INFO] Rank 4 is using GPU 4
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1536
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][WARNING] Device 4 peer access Device 0 is not available.
[TensorRT-LLM][WARNING] Device 4 peer access Device 1 is not available.
[TensorRT-LLM][WARNING] Device 4 peer access Device 2 is not available.
[TensorRT-LLM][INFO] Loaded engine size: 16321 MiB
[TensorRT-LLM][INFO] Rank 2 is using GPU 2
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1536
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][WARNING] Device 2 peer access Device 4 is not available.
[TensorRT-LLM][WARNING] Device 2 peer access Device 6 is not available.
[TensorRT-LLM][WARNING] Device 2 peer access Device 7 is not available.
[TensorRT-LLM][INFO] Loaded engine size: 16321 MiB
[TensorRT-LLM][INFO] Rank 3 is using GPU 3
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1536
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][WARNING] Device 3 peer access Device 5 is not available.
[TensorRT-LLM][WARNING] Device 3 peer access Device 6 is not available.
[TensorRT-LLM][WARNING] Device 3 peer access Device 7 is not available.
[TensorRT-LLM][INFO] Loaded engine size: 16321 MiB
[TensorRT-LLM][INFO] Rank 5 is using GPU 5
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1536
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][WARNING] Device 5 peer access Device 0 is not available.
[TensorRT-LLM][WARNING] Device 5 peer access Device 1 is not available.
[TensorRT-LLM][WARNING] Device 5 peer access Device 3 is not available.
[TensorRT-LLM][INFO] Loaded engine size: 16321 MiB
[TensorRT-LLM][INFO] Rank 1 is using GPU 1
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1536
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][WARNING] Device 1 peer access Device 4 is not available.
[TensorRT-LLM][WARNING] Device 1 peer access Device 5 is not available.
[TensorRT-LLM][WARNING] Device 1 peer access Device 7 is not available.
[TensorRT-LLM][INFO] Loaded engine size: 16321 MiB
[TensorRT-LLM][INFO] Rank 6 is using GPU 6
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1536
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][WARNING] Device 6 peer access Device 0 is not available.
[TensorRT-LLM][WARNING] Device 6 peer access Device 2 is not available.
[TensorRT-LLM][WARNING] Device 6 peer access Device 3 is not available.
[TensorRT-LLM][INFO] Loaded engine size: 16321 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 16428, GPU 19320 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 16429, GPU 19330 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 16428, GPU 19322 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 16429, GPU 19332 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 16428, GPU 19324 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 16429, GPU 19334 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 16428, GPU 19326 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 16429, GPU 19336 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 16428, GPU 19328 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 16429, GPU 19338 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 16428, GPU 19330 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 16429, GPU 19340 (MiB)
[TensorRT-LLM][INFO] Rank 7 is using GPU 7
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1536
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][WARNING] Device 7 peer access Device 1 is not available.
[TensorRT-LLM][WARNING] Device 7 peer access Device 2 is not available.
[TensorRT-LLM][WARNING] Device 7 peer access Device 3 is not available.
[TensorRT-LLM][INFO] Loaded engine size: 18325 MiB
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1536
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][WARNING] Device 0 peer access Device 4 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 5 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 6 is not available.
[TensorRT-LLM][INFO] Loaded engine size: 18325 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 18432, GPU 21336 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 18434, GPU 21346 (MiB)
VM-0-16-ubuntu:390:480 [0] NCCL INFO Bootstrap : Using eth0:10.9.0.16<0>
VM-0-16-ubuntu:390:480 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
VM-0-16-ubuntu:390:480 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6)
VM-0-16-ubuntu:390:480 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
VM-0-16-ubuntu:390:480 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6)
VM-0-16-ubuntu:390:480 [0] NCCL INFO cudaDriverVersion 12010
NCCL version 2.19.4+cuda12.3
VM-0-16-ubuntu:391:485 [1] NCCL INFO cudaDriverVersion 12010
VM-0-16-ubuntu:391:485 [1] NCCL INFO Bootstrap : Using eth0:10.9.0.16<0>
VM-0-16-ubuntu:391:485 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
VM-0-16-ubuntu:391:485 [1] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6)
VM-0-16-ubuntu:391:485 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
VM-0-16-ubuntu:391:485 [1] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6)
VM-0-16-ubuntu:391:485 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
VM-0-16-ubuntu:391:485 [1] NCCL INFO P2P plugin IBext
VM-0-16-ubuntu:391:485 [1] NCCL INFO NET/IB : No device found.
VM-0-16-ubuntu:391:485 [1] NCCL INFO NET/IB : No device found.
VM-0-16-ubuntu:391:485 [1] NCCL INFO NET/Socket : Using [0]eth0:10.9.0.16<0> [1]vethd658a63:fe80::e815:68ff:fe18:955f%vethd658a63<0> [2]vethb7a0656:fe80::70a3:4fff:fe8f:3a93%vethb7a0656<0> [3]veth3df093d:fe80::842b:55ff:fe4d:6af3%veth3df093d<0> [4]vethfd06bad:fe80::f4dd:16ff:fe63:4b14%vethfd06bad<0> [5]vethb504cd9:fe80::4c3:fff:fee5:5cfa%vethb504cd9<0> [6]veth7897b06:fe80::38c6:a1ff:fe8a:2b4a%veth7897b06<0>
VM-0-16-ubuntu:391:485 [1] NCCL INFO Using non-device net plugin version 0
VM-0-16-ubuntu:391:485 [1] NCCL INFO Using network Socket
VM-0-16-ubuntu:390:480 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
VM-0-16-ubuntu:390:480 [0] NCCL INFO P2P plugin IBext
VM-0-16-ubuntu:390:480 [0] NCCL INFO NET/IB : No device found.
VM-0-16-ubuntu:390:480 [0] NCCL INFO NET/IB : No device found.
VM-0-16-ubuntu:390:480 [0] NCCL INFO NET/Socket : Using [0]eth0:10.9.0.16<0> [1]vethd658a63:fe80::e815:68ff:fe18:955f%vethd658a63<0> [2]vethb7a0656:fe80::70a3:4fff:fe8f:3a93%vethb7a0656<0> [3]veth3df093d:fe80::842b:55ff:fe4d:6af3%veth3df093d<0> [4]vethfd06bad:fe80::f4dd:16ff:fe63:4b14%vethfd06bad<0> [5]vethb504cd9:fe80::4c3:fff:fee5:5cfa%vethb504cd9<0> [6]veth7897b06:fe80::38c6:a1ff:fe8a:2b4a%veth7897b06<0>
VM-0-16-ubuntu:390:480 [0] NCCL INFO Using non-device net plugin version 0
VM-0-16-ubuntu:390:480 [0] NCCL INFO Using network Socket
VM-0-16-ubuntu:391:485 [1] NCCL INFO comm 0x7fc43588ac80 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 90 commId 0x66b4a689c4af29a4 - Init START
VM-0-16-ubuntu:390:480 [0] NCCL INFO comm 0x7f68e5d736f0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 80 commId 0x66b4a689c4af29a4 - Init START
VM-0-16-ubuntu:390:480 [0] NCCL INFO Channel 00/02 : 0 1
VM-0-16-ubuntu:390:480 [0] NCCL INFO Channel 01/02 : 0 1
VM-0-16-ubuntu:390:480 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
VM-0-16-ubuntu:390:480 [0] NCCL INFO P2P Chunksize set to 524288
VM-0-16-ubuntu:391:485 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
VM-0-16-ubuntu:391:485 [1] NCCL INFO P2P Chunksize set to 524288
VM-0-16-ubuntu:391:485 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM
VM-0-16-ubuntu:390:480 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM
VM-0-16-ubuntu:391:485 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM
VM-0-16-ubuntu:390:480 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM
VM-0-16-ubuntu:390:480 [0] NCCL INFO Connected all rings
VM-0-16-ubuntu:390:480 [0] NCCL INFO Connected all trees
VM-0-16-ubuntu:391:485 [1] NCCL INFO Connected all rings
VM-0-16-ubuntu:391:485 [1] NCCL INFO Connected all trees
VM-0-16-ubuntu:391:485 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
VM-0-16-ubuntu:391:485 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
VM-0-16-ubuntu:390:480 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
VM-0-16-ubuntu:390:480 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
VM-0-16-ubuntu:391:485 [1] NCCL INFO comm 0x7fc43588ac80 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 90 commId 0x66b4a689c4af29a4 - Init COMPLETE
VM-0-16-ubuntu:390:480 [0] NCCL INFO comm 0x7f68e5d736f0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 80 commId 0x66b4a689c4af29a4 - Init COMPLETE
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +18324, now: CPU 0, GPU 18324 (MiB)
NCCL version 2.19.4+cuda12.3
VM-0-16-ubuntu:391:485 [1] NCCL INFO Using non-device net plugin version 0
VM-0-16-ubuntu:391:485 [1] NCCL INFO Using network Socket
VM-0-16-ubuntu:392:494 [2] NCCL INFO cudaDriverVersion 12010
VM-0-16-ubuntu:392:494 [2] NCCL INFO Bootstrap : Using eth0:10.9.0.16<0>
VM-0-16-ubuntu:392:494 [2] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
VM-0-16-ubuntu:392:494 [2] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6)
VM-0-16-ubuntu:392:494 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
VM-0-16-ubuntu:392:494 [2] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6)
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
### Expected behavior
triton server load successfully
### actual behavior
triton server failed in loading
### additional notes
I tried tp=2 pp=4, this works, but it fails to load when pp=8
It looks the program really OOM because in PP, first GPU and last GPU often require more memory.
Could you try using smaller batch size, input length, output length or try GPU with larger memory?
System Info
-CPU: x86
Driver:
Who can help?
@byshiue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
checkout tensorrt-llm and backend v0.9.0, and follow https://developer.nvidia.com/zh-cn/blog/turbocharging-meta-llama-3-performance-with-nvidia-tensorrt-llm-and-nvidia-triton-inference-server/
build llama-3 engine in
nvidia/cuda:12.1.0-devel-ubuntu22.04
:model=/home/jgq/cloud/models/llama3-70b-instruct checkpoint=/home/jgq/cloud/engines/llama3-70b-instruct/cvt-tp output=/home/jgq/cloud/engines/llama3-70b-instruct/engines-tp rm -rf $checkpoint && mkdir $checkpoint rm -rf $output && mkdir $output tp=2 pp=4 dtype=float16
cd $trtllm/examples/llama python3 convert_checkpoint.py --model_dir $model \ --output_dir $checkpoint \ --tp_size $tp \ --pp_size $pp \ --dtype $dtype
trtllm-build --checkpoint_dir $checkpoint \ --output_dir $output \ --max_batch_size 64 \ --max_input_len 1024 \ --max_output_len 512 \ --tp_size $tp \ --pp_size $pp \ --gpt_attention_plugin $dtype \ --gemm_plugin $dtype
image=nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3 name=triton-pp8
nvidia-docker run -e DISPLAY=unix: -it --net=host --ulimit core=-1 --ulimit memlock=-1 \ --security-opt seccomp=unconfined --detach-keys=ctrl-i,c \ --shm-size='10g' --ipc=host \ -v /home/jgq:/home/jgq \ -v /var/run/docker.sock:/var/run/docker.sock \ -v /usr/bin/docker:/usr/bin/docker \ -v /tmp/.X11-unix/:/tmp/.X11-unix/ \ -w /home/jgq \ --privileged -v /etc/timezone:/etc/timezone:ro \ --name $name $image /bin/bash
export NCCL_DEBUG=INFO python3 /home/jgq/tensorrtllm_backend/scripts/launch_triton_server.py --model_repo /home/jgq/tensorrtllm_backend/all_models/inflight_batcher_llm_pp8 --world_size 8
I0524 22:36:07.146691 392 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7f5c86000000' with size 268435456 I0524 22:36:07.174382 397 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7fb464000000' with size 268435456 I0524 22:36:07.183317 390 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7f69be000000' with size 268435456 I0524 22:36:07.184496 393 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7fb2f4000000' with size 268435456 I0524 22:36:07.185052 394 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7f2396000000' with size 268435456 I0524 22:36:07.185193 395 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7f3e78000000' with size 268435456 I0524 22:36:07.185852 396 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7f25f8000000' with size 268435456 I0524 22:36:07.186180 391 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7fc508000000' with size 268435456 I0524 22:36:07.243768 392 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864 I0524 22:36:07.243785 392 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864 I0524 22:36:07.243790 392 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864 I0524 22:36:07.243794 392 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864 I0524 22:36:07.243799 392 cuda_memory_manager.cc:107] CUDA memory pool is created on device 4 with size 67108864 I0524 22:36:07.243803 392 cuda_memory_manager.cc:107] CUDA memory pool is created on device 5 with size 67108864 I0524 22:36:07.243807 392 cuda_memory_manager.cc:107] CUDA memory pool is created on device 6 with size 67108864 I0524 22:36:07.243811 392 cuda_memory_manager.cc:107] CUDA memory pool is created on device 7 with size 67108864 I0524 22:36:07.244352 397 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864 I0524 22:36:07.244376 397 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864 I0524 22:36:07.244381 397 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864 I0524 22:36:07.244386 397 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864 I0524 22:36:07.244391 397 cuda_memory_manager.cc:107] CUDA memory pool is created on device 4 with size 67108864 I0524 22:36:07.244395 397 cuda_memory_manager.cc:107] CUDA memory pool is created on device 5 with size 67108864 I0524 22:36:07.244399 397 cuda_memory_manager.cc:107] CUDA memory pool is created on device 6 with size 67108864 I0524 22:36:07.244402 397 cuda_memory_manager.cc:107] CUDA memory pool is created on device 7 with size 67108864 I0524 22:36:07.246620 393 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864 I0524 22:36:07.246636 393 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864 I0524 22:36:07.246641 393 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864 I0524 22:36:07.246645 393 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864 I0524 22:36:07.246650 393 cuda_memory_manager.cc:107] CUDA memory pool is created on device 4 with size 67108864 I0524 22:36:07.246654 393 cuda_memory_manager.cc:107] CUDA memory pool is created on device 5 with size 67108864 I0524 22:36:07.246658 393 cuda_memory_manager.cc:107] CUDA memory pool is created on device 6 with size 67108864 I0524 22:36:07.246662 393 cuda_memory_manager.cc:107] CUDA memory pool is created on device 7 with size 67108864 I0524 22:36:07.246708 394 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864 I0524 22:36:07.246725 394 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864 I0524 22:36:07.246730 394 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864 I0524 22:36:07.246734 394 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864 I0524 22:36:07.246738 394 cuda_memory_manager.cc:107] CUDA memory pool is created on device 4 with size 67108864 I0524 22:36:07.246742 394 cuda_memory_manager.cc:107] CUDA memory pool is created on device 5 with size 67108864 I0524 22:36:07.246746 394 cuda_memory_manager.cc:107] CUDA memory pool is created on device 6 with size 67108864 I0524 22:36:07.246749 394 cuda_memory_manager.cc:107] CUDA memory pool is created on device 7 with size 67108864 I0524 22:36:07.246839 395 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864 I0524 22:36:07.246853 395 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864 I0524 22:36:07.246858 395 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864 I0524 22:36:07.246862 395 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864 I0524 22:36:07.246866 395 cuda_memory_manager.cc:107] CUDA memory pool is created on device 4 with size 67108864 I0524 22:36:07.246870 395 cuda_memory_manager.cc:107] CUDA memory pool is created on device 5 with size 67108864 I0524 22:36:07.246874 395 cuda_memory_manager.cc:107] CUDA memory pool is created on device 6 with size 67108864 I0524 22:36:07.246879 395 cuda_memory_manager.cc:107] CUDA memory pool is created on device 7 with size 67108864 I0524 22:36:07.247296 396 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864 I0524 22:36:07.247316 396 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864 I0524 22:36:07.247320 396 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864 I0524 22:36:07.247324 396 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864 I0524 22:36:07.247329 396 cuda_memory_manager.cc:107] CUDA memory pool is created on device 4 with size 67108864 I0524 22:36:07.247332 396 cuda_memory_manager.cc:107] CUDA memory pool is created on device 5 with size 67108864 I0524 22:36:07.247336 396 cuda_memory_manager.cc:107] CUDA memory pool is created on device 6 with size 67108864 I0524 22:36:07.247340 396 cuda_memory_manager.cc:107] CUDA memory pool is created on device 7 with size 67108864 I0524 22:36:07.247561 391 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864 I0524 22:36:07.247577 391 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864 I0524 22:36:07.247581 391 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864 I0524 22:36:07.247585 391 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864 I0524 22:36:07.247590 391 cuda_memory_manager.cc:107] CUDA memory pool is created on device 4 with size 67108864 I0524 22:36:07.247593 391 cuda_memory_manager.cc:107] CUDA memory pool is created on device 5 with size 67108864 I0524 22:36:07.247597 391 cuda_memory_manager.cc:107] CUDA memory pool is created on device 6 with size 67108864 I0524 22:36:07.247601 391 cuda_memory_manager.cc:107] CUDA memory pool is created on device 7 with size 67108864 I0524 22:36:07.255679 390 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864 I0524 22:36:07.255699 390 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864 I0524 22:36:07.255703 390 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864 I0524 22:36:07.255708 390 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864 I0524 22:36:07.255713 390 cuda_memory_manager.cc:107] CUDA memory pool is created on device 4 with size 67108864 I0524 22:36:07.255717 390 cuda_memory_manager.cc:107] CUDA memory pool is created on device 5 with size 67108864 I0524 22:36:07.255720 390 cuda_memory_manager.cc:107] CUDA memory pool is created on device 6 with size 67108864 I0524 22:36:07.255724 390 cuda_memory_manager.cc:107] CUDA memory pool is created on device 7 with size 67108864 W0524 22:36:13.937370 397 server.cc:251] failed to enable peer access for some device pairs I0524 22:36:13.951333 397 model_lifecycle.cc:469] loading: tensorrt_llm:1 W0524 22:36:13.987314 393 server.cc:251] failed to enable peer access for some device pairs I0524 22:36:13.989355 393 model_lifecycle.cc:469] loading: tensorrt_llm:1 W0524 22:36:14.159477 390 server.cc:251] failed to enable peer access for some device pairs I0524 22:36:14.164367 390 model_lifecycle.cc:469] loading: postprocessing:1 I0524 22:36:14.164431 390 model_lifecycle.cc:469] loading: preprocessing:1 I0524 22:36:14.164514 390 model_lifecycle.cc:469] loading: tensorrt_llm:1 W0524 22:36:14.216422 391 server.cc:251] failed to enable peer access for some device pairs I0524 22:36:14.218403 391 model_lifecycle.cc:469] loading: tensorrt_llm:1 W0524 22:36:14.258596 395 server.cc:251] failed to enable peer access for some device pairs I0524 22:36:14.260504 395 model_lifecycle.cc:469] loading: tensorrt_llm:1 W0524 22:36:14.262737 392 server.cc:251] failed to enable peer access for some device pairs I0524 22:36:14.264624 392 model_lifecycle.cc:469] loading: tensorrt_llm:1 W0524 22:36:14.268510 394 server.cc:251] failed to enable peer access for some device pairs W0524 22:36:14.268637 396 server.cc:251] failed to enable peer access for some device pairs I0524 22:36:14.270619 396 model_lifecycle.cc:469] loading: tensorrt_llm:1 I0524 22:36:14.270685 394 model_lifecycle.cc:469] loading: tensorrt_llm:1 [TensorRT-LLM][INFO] Initializing MPI with thread mode 3 [TensorRT-LLM][INFO] Initializing MPI with thread mode 3 [TensorRT-LLM][INFO] Initializing MPI with thread mode 3 [TensorRT-LLM][INFO] Initializing MPI with thread mode 3 [TensorRT-LLM][INFO] Initializing MPI with thread mode 3 [TensorRT-LLM][INFO] Initializing MPI with thread mode 3 [TensorRT-LLM][INFO] Initializing MPI with thread mode 3 [TensorRT-LLM][INFO] Initializing MPI with thread mode 3 [TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set [TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set [TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set [TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set [TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set [TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set [TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict) [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict) [TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false. [TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false [TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict) [TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false. [TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false [TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict) [TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false. [TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false [TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict) [TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false. [TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false [TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict) [TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false. [TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false [TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64 [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8 [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05 [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB [TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false. [TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false [TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64 [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8 [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05 [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB [TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict) [TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false. [TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false [TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64 [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8 [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05 [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB [TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64 [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8 [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05 [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB [TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64 [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8 [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05 [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB [TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64 [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8 [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05 [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB [TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead [TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead [TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64 [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8 [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05 [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB [TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead [TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found [TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found [TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found [TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found [TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found [TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set. [TensorRT-LLM][INFO] MPI size: 8, rank: 4 [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found [TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set. [TensorRT-LLM][INFO] MPI size: 8, rank: 1 [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found [TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set. [TensorRT-LLM][INFO] MPI size: 8, rank: 3 [TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set. [TensorRT-LLM][INFO] MPI size: 8, rank: 5 [TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set. [TensorRT-LLM][INFO] MPI size: 8, rank: 2 [TensorRT-LLM][INFO] MPI size: 8, rank: 6 [TensorRT-LLM][INFO] MPI size: 8, rank: 7 [TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict) [TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false. [TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false [TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64 [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8 [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05 [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB [TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead [TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found [TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set. [TensorRT-LLM][INFO] MPI size: 8, rank: 0 I0524 22:36:16.634983 390 python_be.cc:2404] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0) I0524 22:36:16.635013 390 python_be.cc:2404] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. I0524 22:36:20.245422 390 model_lifecycle.cc:835] successfully loaded 'postprocessing' Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. I0524 22:36:20.252526 390 model_lifecycle.cc:835] successfully loaded 'preprocessing' [TensorRT-LLM][INFO] Rank 4 is using GPU 4 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1536 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][WARNING] Device 4 peer access Device 0 is not available. [TensorRT-LLM][WARNING] Device 4 peer access Device 1 is not available. [TensorRT-LLM][WARNING] Device 4 peer access Device 2 is not available. [TensorRT-LLM][INFO] Loaded engine size: 16321 MiB [TensorRT-LLM][INFO] Rank 2 is using GPU 2 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1536 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][WARNING] Device 2 peer access Device 4 is not available. [TensorRT-LLM][WARNING] Device 2 peer access Device 6 is not available. [TensorRT-LLM][WARNING] Device 2 peer access Device 7 is not available. [TensorRT-LLM][INFO] Loaded engine size: 16321 MiB [TensorRT-LLM][INFO] Rank 3 is using GPU 3 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1536 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][WARNING] Device 3 peer access Device 5 is not available. [TensorRT-LLM][WARNING] Device 3 peer access Device 6 is not available. [TensorRT-LLM][WARNING] Device 3 peer access Device 7 is not available. [TensorRT-LLM][INFO] Loaded engine size: 16321 MiB [TensorRT-LLM][INFO] Rank 5 is using GPU 5 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1536 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][WARNING] Device 5 peer access Device 0 is not available. [TensorRT-LLM][WARNING] Device 5 peer access Device 1 is not available. [TensorRT-LLM][WARNING] Device 5 peer access Device 3 is not available. [TensorRT-LLM][INFO] Loaded engine size: 16321 MiB [TensorRT-LLM][INFO] Rank 1 is using GPU 1 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1536 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][WARNING] Device 1 peer access Device 4 is not available. [TensorRT-LLM][WARNING] Device 1 peer access Device 5 is not available. [TensorRT-LLM][WARNING] Device 1 peer access Device 7 is not available. [TensorRT-LLM][INFO] Loaded engine size: 16321 MiB [TensorRT-LLM][INFO] Rank 6 is using GPU 6 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1536 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][WARNING] Device 6 peer access Device 0 is not available. [TensorRT-LLM][WARNING] Device 6 peer access Device 2 is not available. [TensorRT-LLM][WARNING] Device 6 peer access Device 3 is not available. [TensorRT-LLM][INFO] Loaded engine size: 16321 MiB [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 16428, GPU 19320 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 16429, GPU 19330 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 16428, GPU 19322 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 16429, GPU 19332 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 16428, GPU 19324 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 16429, GPU 19334 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 16428, GPU 19326 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 16429, GPU 19336 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 16428, GPU 19328 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 16429, GPU 19338 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 16428, GPU 19330 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 16429, GPU 19340 (MiB) [TensorRT-LLM][INFO] Rank 7 is using GPU 7 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1536 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][WARNING] Device 7 peer access Device 1 is not available. [TensorRT-LLM][WARNING] Device 7 peer access Device 2 is not available. [TensorRT-LLM][WARNING] Device 7 peer access Device 3 is not available. [TensorRT-LLM][INFO] Loaded engine size: 18325 MiB [TensorRT-LLM][INFO] Rank 0 is using GPU 0 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1536 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][WARNING] Device 0 peer access Device 4 is not available. [TensorRT-LLM][WARNING] Device 0 peer access Device 5 is not available. [TensorRT-LLM][WARNING] Device 0 peer access Device 6 is not available. [TensorRT-LLM][INFO] Loaded engine size: 18325 MiB [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 18432, GPU 21336 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 18434, GPU 21346 (MiB) VM-0-16-ubuntu:390:480 [0] NCCL INFO Bootstrap : Using eth0:10.9.0.16<0> VM-0-16-ubuntu:390:480 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol. VM-0-16-ubuntu:390:480 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6) VM-0-16-ubuntu:390:480 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol. VM-0-16-ubuntu:390:480 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6) VM-0-16-ubuntu:390:480 [0] NCCL INFO cudaDriverVersion 12010 NCCL version 2.19.4+cuda12.3 VM-0-16-ubuntu:391:485 [1] NCCL INFO cudaDriverVersion 12010 VM-0-16-ubuntu:391:485 [1] NCCL INFO Bootstrap : Using eth0:10.9.0.16<0> VM-0-16-ubuntu:391:485 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol. VM-0-16-ubuntu:391:485 [1] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6) VM-0-16-ubuntu:391:485 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol. VM-0-16-ubuntu:391:485 [1] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6) VM-0-16-ubuntu:391:485 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so VM-0-16-ubuntu:391:485 [1] NCCL INFO P2P plugin IBext VM-0-16-ubuntu:391:485 [1] NCCL INFO NET/IB : No device found. VM-0-16-ubuntu:391:485 [1] NCCL INFO NET/IB : No device found. VM-0-16-ubuntu:391:485 [1] NCCL INFO NET/Socket : Using [0]eth0:10.9.0.16<0> [1]vethd658a63:fe80::e815:68ff:fe18:955f%vethd658a63<0> [2]vethb7a0656:fe80::70a3:4fff:fe8f:3a93%vethb7a0656<0> [3]veth3df093d:fe80::842b:55ff:fe4d:6af3%veth3df093d<0> [4]vethfd06bad:fe80::f4dd:16ff:fe63:4b14%vethfd06bad<0> [5]vethb504cd9:fe80::4c3:fff:fee5:5cfa%vethb504cd9<0> [6]veth7897b06:fe80::38c6:a1ff:fe8a:2b4a%veth7897b06<0> VM-0-16-ubuntu:391:485 [1] NCCL INFO Using non-device net plugin version 0 VM-0-16-ubuntu:391:485 [1] NCCL INFO Using network Socket VM-0-16-ubuntu:390:480 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so VM-0-16-ubuntu:390:480 [0] NCCL INFO P2P plugin IBext VM-0-16-ubuntu:390:480 [0] NCCL INFO NET/IB : No device found. VM-0-16-ubuntu:390:480 [0] NCCL INFO NET/IB : No device found. VM-0-16-ubuntu:390:480 [0] NCCL INFO NET/Socket : Using [0]eth0:10.9.0.16<0> [1]vethd658a63:fe80::e815:68ff:fe18:955f%vethd658a63<0> [2]vethb7a0656:fe80::70a3:4fff:fe8f:3a93%vethb7a0656<0> [3]veth3df093d:fe80::842b:55ff:fe4d:6af3%veth3df093d<0> [4]vethfd06bad:fe80::f4dd:16ff:fe63:4b14%vethfd06bad<0> [5]vethb504cd9:fe80::4c3:fff:fee5:5cfa%vethb504cd9<0> [6]veth7897b06:fe80::38c6:a1ff:fe8a:2b4a%veth7897b06<0> VM-0-16-ubuntu:390:480 [0] NCCL INFO Using non-device net plugin version 0 VM-0-16-ubuntu:390:480 [0] NCCL INFO Using network Socket VM-0-16-ubuntu:391:485 [1] NCCL INFO comm 0x7fc43588ac80 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 90 commId 0x66b4a689c4af29a4 - Init START VM-0-16-ubuntu:390:480 [0] NCCL INFO comm 0x7f68e5d736f0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 80 commId 0x66b4a689c4af29a4 - Init START VM-0-16-ubuntu:390:480 [0] NCCL INFO Channel 00/02 : 0 1 VM-0-16-ubuntu:390:480 [0] NCCL INFO Channel 01/02 : 0 1 VM-0-16-ubuntu:390:480 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 VM-0-16-ubuntu:390:480 [0] NCCL INFO P2P Chunksize set to 524288 VM-0-16-ubuntu:391:485 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 VM-0-16-ubuntu:391:485 [1] NCCL INFO P2P Chunksize set to 524288 VM-0-16-ubuntu:391:485 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM VM-0-16-ubuntu:390:480 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM VM-0-16-ubuntu:391:485 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM VM-0-16-ubuntu:390:480 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM VM-0-16-ubuntu:390:480 [0] NCCL INFO Connected all rings VM-0-16-ubuntu:390:480 [0] NCCL INFO Connected all trees VM-0-16-ubuntu:391:485 [1] NCCL INFO Connected all rings VM-0-16-ubuntu:391:485 [1] NCCL INFO Connected all trees VM-0-16-ubuntu:391:485 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 VM-0-16-ubuntu:391:485 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer VM-0-16-ubuntu:390:480 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 VM-0-16-ubuntu:390:480 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer VM-0-16-ubuntu:391:485 [1] NCCL INFO comm 0x7fc43588ac80 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 90 commId 0x66b4a689c4af29a4 - Init COMPLETE VM-0-16-ubuntu:390:480 [0] NCCL INFO comm 0x7f68e5d736f0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 80 commId 0x66b4a689c4af29a4 - Init COMPLETE [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +18324, now: CPU 0, GPU 18324 (MiB) NCCL version 2.19.4+cuda12.3 VM-0-16-ubuntu:391:485 [1] NCCL INFO Using non-device net plugin version 0 VM-0-16-ubuntu:391:485 [1] NCCL INFO Using network Socket VM-0-16-ubuntu:392:494 [2] NCCL INFO cudaDriverVersion 12010 VM-0-16-ubuntu:392:494 [2] NCCL INFO Bootstrap : Using eth0:10.9.0.16<0> VM-0-16-ubuntu:392:494 [2] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol. VM-0-16-ubuntu:392:494 [2] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6) VM-0-16-ubuntu:392:494 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol. VM-0-16-ubuntu:392:494 [2] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6)
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:43 NCCL WARN Cuda failure 'out of memory'
VM-0-16-ubuntu:392:494 [2] enqueue.cc:54 NCCL WARN Cuda failure 'invalid resource handle' VM-0-16-ubuntu:392:494 [2] NCCL INFO init.cc:1364 -> 1 VM-0-16-ubuntu:392:494 [2] NCCL INFO init.cc:1635 -> 1 VM-0-16-ubuntu:392:494 [2] NCCL INFO init.cc:1673 -> 1 Failed, NCCL error /tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/plugins/ncclPlugin/recvPlugin.cpp:132 'unhandled cuda error (run with NCCL_DEBUG=INFO for details)' [TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] CUDA runtime error in cublasCreate(handle.get()): CUBLAS_STATUS_ALLOC_FAILED (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/plugins/common/plugin.cpp:190) 1 0x7fb30cb0ef12 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so.9(+0x57f12) [0x7fb30cb0ef12] 2 0x7fb30cc4f693 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so.9(+0x198693) [0x7fb30cc4f693] 3 0x7fb30cc1d049 tensorrt_llm::plugins::GemmPlugin::init() + 41 4 0x7fb30cc1db9a tensorrt_llm::plugins::GemmPlugin::GemmPlugin(void const, unsigned long, std::shared_ptr const&) + 298
5 0x7fb30cc1dcef tensorrt_llm::plugins::GemmPluginCreator::deserializePlugin(char const , void const, unsigned long) + 191
6 0x7fb2c86f1506 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10d8506) [0x7fb2c86f1506]
7 0x7fb2c86fe0ae /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10e50ae) [0x7fb2c86fe0ae]
8 0x7fb2c8686e17 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x106de17) [0x7fb2c8686e17]
9 0x7fb2c8684d9e /usr/local/tensorrt/lib/libnvinfer.so.9(+0x106bd9e) [0x7fb2c8684d9e]
10 0x7fb2c869cc8b /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1083c8b) [0x7fb2c869cc8b]
11 0x7fb2c869ff12 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1086f12) [0x7fb2c869ff12]
12 0x7fb2c86a02ec /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10872ec) [0x7fb2c86a02ec]
13 0x7fb2c86d39b1 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10ba9b1) [0x7fb2c86d39b1]
14 0x7fb2c86d4777 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10bb777) [0x7fb2c86d4777]
15 0x7fb3887b6f52 tensorrt_llm::runtime::TllmRuntime::TllmRuntime(void const, unsigned long, nvinfer1::ILogger&) + 482
16 0x7fb38884e6b6 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(int, std::shared_ptr, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, std::vector<unsigned char, std::allocator > const&, bool, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1222
17 0x7fb38880cd5a tensorrt_llm::batch_manager::TrtGptModelFactory::create(std::filesystem::cxx11::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1930
18 0x7fb388804170 tensorrt_llm::batch_manager::GptManager::GptManager(std::filesystem::__cxx11::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, std::function<std::cxx11::list<std::shared_ptr, std::allocator<std::shared_ptr > > (int)>, std::function<void (unsigned long, std::cxx11::list<tensorrt_llm::batch_manager::NamedTensor, std::allocator > const&, bool, std:: cxx11::basic_string<char, std::char_traits, std::allocator > const&)>, std::function<std::unordered_set<unsigned long, std::hash, std::equal_to, std::allocator > ()>, std::function<void (std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&)>, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&, std::optional, std::optional, bool) + 336
19 0x7fb4a0108075 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState, TRITONBACKEND_ModelInstance, ompi_communicator_t) + 4901
20 0x7fb4a0109019 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 73
21 0x7fb4a014741c TRITONBACKEND_ModelInstanceInitialize + 828
22 0x7fb4ae124086 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1af086) [0x7fb4ae124086]
23 0x7fb4ae1252c6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1b02c6) [0x7fb4ae1252c6]
24 0x7fb4ae1078d5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1928d5) [0x7fb4ae1078d5]
25 0x7fb4ae107f16 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x192f16) [0x7fb4ae107f16]
26 0x7fb4ae11480d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19f80d) [0x7fb4ae11480d]
27 0x7fb4ad776ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7fb4ad776ee8]
28 0x7fb4ae0fe64b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18964b) [0x7fb4ae0fe64b]
29 0x7fb4ae10f4f5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19a4f5) [0x7fb4ae10f4f5]
30 0x7fb4ae113c2e /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19ec2e) [0x7fb4ae113c2e]
31 0x7fb4ae208318 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x293318) [0x7fb4ae208318]
32 0x7fb4ae20bbfc /opt/tritonserver/bin/../lib/libtritonserver.so(+0x296bfc) [0x7fb4ae20bbfc]
33 0x7fb4ae367a02 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3f2a02) [0x7fb4ae367a02]
34 0x7fb4ad9e2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fb4ad9e2253]
35 0x7fb4ad771ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fb4ad771ac3]
36 0x7fb4ad803850 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7fb4ad803850]
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 18432, GPU 21472 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 18433, GPU 21482 (MiB)