NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.12k stars 896 forks source link

misc/strongstream.cc:343 NCCL WARN Cuda failure 'an illegal memory access was encountered' #448

Open BasicCoder opened 9 months ago

BasicCoder commented 9 months ago

I used 4TP when deploying the 70B model. This NCCL error occurred on two of the PODs. This error does not occur on every POD. This error occurred after TRT-LLM was executed normally for a period of time.

TRT version: commit 4de32a86ae92bc49a7ec17c00ec2f2d03663c198 Execution env: 4*A100 80GB PCIE ERROR LOG:

+ NCCL_DEBUG=INFO
+ CUDA_VISIBLE_DEVICES=0,1,2,3
+ python3 scripts/launch_triton_server.py --world_size=4 --model_repo=/all_models/inflight_batcher_llm
I1121 06:29:44.018785 6859 libtorch.cc:2507] TRITONBACKEND_Initialize: pytorch
I1121 06:29:44.018825 6859 libtorch.cc:2517] Triton TRITONBACKEND API version: 1.15
I1121 06:29:44.018785 6860 libtorch.cc:2507] TRITONBACKEND_Initialize: pytorch
I1121 06:29:44.018825 6860 libtorch.cc:2517] Triton TRITONBACKEND API version: 1.15
I1121 06:29:44.018835 6860 libtorch.cc:2523] 'pytorch' TRITONBACKEND API version: 1.15
I1121 06:29:44.018785 6862 libtorch.cc:2507] TRITONBACKEND_Initialize: pytorch
I1121 06:29:44.018825 6862 libtorch.cc:2517] Triton TRITONBACKEND API version: 1.15
I1121 06:29:44.018834 6862 libtorch.cc:2523] 'pytorch' TRITONBACKEND API version: 1.15
I1121 06:29:44.018785 6861 libtorch.cc:2507] TRITONBACKEND_Initialize: pytorch
I1121 06:29:44.018825 6861 libtorch.cc:2517] Triton TRITONBACKEND API version: 1.15
I1121 06:29:44.018835 6861 libtorch.cc:2523] 'pytorch' TRITONBACKEND API version: 1.15
I1121 06:29:44.018916 6859 libtorch.cc:2523] 'pytorch' TRITONBACKEND API version: 1.15
I1121 06:29:44.706331 6862 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7f7b48000000' with size 268435456
I1121 06:29:44.706610 6860 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7f1a60000000' with size 268435456
I1121 06:29:44.706896 6861 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7f536c000000' with size 268435456
I1121 06:29:44.707014 6859 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7f2a04000000' with size 268435456
I1121 06:29:44.765352 6862 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I1121 06:29:44.765364 6862 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I1121 06:29:44.765367 6862 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864
I1121 06:29:44.765370 6862 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864
I1121 06:29:44.765635 6860 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I1121 06:29:44.765646 6860 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I1121 06:29:44.765650 6860 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864
I1121 06:29:44.765653 6860 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864
I1121 06:29:44.766333 6861 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I1121 06:29:44.766348 6861 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I1121 06:29:44.766351 6861 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864
I1121 06:29:44.766354 6861 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864
I1121 06:29:44.766613 6859 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I1121 06:29:44.766624 6859 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I1121 06:29:44.766628 6859 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864
I1121 06:29:44.766631 6859 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864
I1121 06:29:46.286285 6861 model_lifecycle.cc:462] loading: postprocessing:1
I1121 06:29:46.286294 6860 model_lifecycle.cc:462] loading: tensorrt_llm:1
I1121 06:29:46.286327 6860 model_lifecycle.cc:462] loading: postprocessing:1
I1121 06:29:46.286294 6862 model_lifecycle.cc:462] loading: tensorrt_llm:1
I1121 06:29:46.286327 6862 model_lifecycle.cc:462] loading: postprocessing:1
I1121 06:29:46.286298 6859 model_lifecycle.cc:462] loading: tensorrt_llm:1
I1121 06:29:46.286331 6859 model_lifecycle.cc:462] loading: postprocessing:1
I1121 06:29:46.286326 6861 model_lifecycle.cc:462] loading: tensorrt_llm:1
I1121 06:29:46.286352 6859 model_lifecycle.cc:462] loading: preprocessing:1
I1121 06:29:46.286422 6861 model_lifecycle.cc:462] loading: preprocessing:1
I1121 06:29:46.286492 6860 model_lifecycle.cc:462] loading: preprocessing:1
I1121 06:29:46.286500 6862 model_lifecycle.cc:462] loading: preprocessing:1
I1121 06:29:46.311921 6861 python_be.cc:2115] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
I1121 06:29:46.311932 6861 python_be.cc:2115] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.85 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] max_num_sequences is not specified, will be set to the TRT engine max_batch_size
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to true
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
I1121 06:29:46.774831 6860 python_be.cc:2115] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.85 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] max_num_sequences is not specified, will be set to the TRT engine max_batch_size
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to true
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
I1121 06:29:46.775212 6860 python_be.cc:2115] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.85 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] max_num_sequences is not specified, will be set to the TRT engine max_batch_size
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to true
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
I1121 06:29:46.776830 6859 python_be.cc:2115] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
I1121 06:29:46.776925 6862 python_be.cc:2115] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.85 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] max_num_sequences is not specified, will be set to the TRT engine max_batch_size
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to true
I1121 06:29:46.777106 6859 python_be.cc:2115] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
I1121 06:29:46.777396 6862 python_be.cc:2115] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
[TensorRT-LLM][INFO] MPI size: 4, rank: 3
[TensorRT-LLM][INFO] MPI size: 4, rank: 0
[TensorRT-LLM][INFO] MPI size: 4, rank: 2
[TensorRT-LLM][INFO] MPI size: 4, rank: 1
tokenizer: 2 1 2
I1121 06:29:47.896099 6861 model_lifecycle.cc:819] successfully loaded 'postprocessing'
tokenizer: 2 1 2
I1121 06:29:47.994187 6862 model_lifecycle.cc:819] successfully loaded 'postprocessing'
tokenizer: 2 1 2
tokenizer: 2 1 2
I1121 06:29:47.999332 6860 model_lifecycle.cc:819] successfully loaded 'postprocessing'
I1121 06:29:47.999890 6859 model_lifecycle.cc:819] successfully loaded 'postprocessing'
tokenizer: 2 1 2
tokenizer: 2 1 2
I1121 06:29:49.728416 6860 model_lifecycle.cc:819] successfully loaded 'preprocessing'
I1121 06:29:49.728703 6861 model_lifecycle.cc:819] successfully loaded 'preprocessing'
tokenizer: 2 1 2
I1121 06:29:49.732066 6859 model_lifecycle.cc:819] successfully loaded 'preprocessing'
tokenizer: 2 1 2
I1121 06:29:49.740389 6862 model_lifecycle.cc:819] successfully loaded 'preprocessing'
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 8
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 4
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1
[TensorRT-LLM][INFO] Loaded engine size: 32953 MiB
[TensorRT-LLM][WARNING] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 8
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 4
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1
[TensorRT-LLM][INFO] Loaded engine size: 32953 MiB
[TensorRT-LLM][WARNING] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 8
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 4
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1
[TensorRT-LLM][INFO] Loaded engine size: 32953 MiB
[TensorRT-LLM][WARNING] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 8
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 4
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1
[TensorRT-LLM][INFO] Loaded engine size: 32953 MiB
[TensorRT-LLM][WARNING] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 33240, GPU 34890 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 33241, GPU 34900 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 33240, GPU 34890 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 33241, GPU 34900 (MiB)
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6859:6969 [0] NCCL INFO Bootstrap : Using eth0:10.156.28.90<0>
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6859:6969 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6859:6969 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6)
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6859:6969 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6859:6969 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6)
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6859:6969 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.19.3+cuda12.3
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6862:6975 [3] NCCL INFO cudaDriverVersion 12020
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6862:6975 [3] NCCL INFO Bootstrap : Using eth0:10.156.28.90<0>
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6862:6975 [3] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6862:6975 [3] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6)
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6862:6975 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6862:6975 [3] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 33240, GPU 34890 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 33241, GPU 34900 (MiB)
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6860:6967 [1] NCCL INFO cudaDriverVersion 12020
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6860:6967 [1] NCCL INFO Bootstrap : Using eth0:10.156.28.90<0>
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6859:6969 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6859:6969 [0] NCCL INFO P2P plugin IBext
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6859:6969 [0] NCCL INFO NET/IB : No device found.
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6859:6969 [0] NCCL INFO NET/IB : No device found.
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6859:6969 [0] NCCL INFO NET/Socket : Using [0]eth0:10.156.28.90<0>
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6859:6969 [0] NCCL INFO Using non-device net plugin version 0
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6859:6969 [0] NCCL INFO Using network Socket
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6860:6967 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6860:6967 [1] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6)
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6860:6967 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6860:6967 [1] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6)
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6862:6975 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6862:6975 [3] NCCL INFO P2P plugin IBext
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6862:6975 [3] NCCL INFO NET/IB : No device found.
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6862:6975 [3] NCCL INFO NET/IB : No device found.
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6862:6975 [3] NCCL INFO NET/Socket : Using [0]eth0:10.156.28.90<0>
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6862:6975 [3] NCCL INFO Using non-device net plugin version 0
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6862:6975 [3] NCCL INFO Using network Socket
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6860:6967 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6860:6967 [1] NCCL INFO P2P plugin IBext
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6860:6967 [1] NCCL INFO NET/IB : No device found.
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6860:6967 [1] NCCL INFO NET/IB : No device found.
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6860:6967 [1] NCCL INFO NET/Socket : Using [0]eth0:10.156.28.90<0>
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6860:6967 [1] NCCL INFO Using non-device net plugin version 0
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6860:6967 [1] NCCL INFO Using network Socket
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 33240, GPU 34890 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 33241, GPU 34900 (MiB)
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6861:6979 [2] NCCL INFO cudaDriverVersion 12020
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6861:6979 [2] NCCL INFO Bootstrap : Using eth0:10.156.28.90<0>
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6861:6979 [2] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6861:6979 [2] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6)
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6861:6979 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6861:6979 [2] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6)
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6861:6979 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6861:6979 [2] NCCL INFO P2P plugin IBext
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6861:6979 [2] NCCL INFO NET/IB : No device found.
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6861:6979 [2] NCCL INFO NET/IB : No device found.
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6861:6979 [2] NCCL INFO NET/Socket : Using [0]eth0:10.156.28.90<0>
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6861:6979 [2] NCCL INFO Using non-device net plugin version 0
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6861:6979 [2] NCCL INFO Using network Socket
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6861:6979 [2] NCCL INFO comm 0x7f52f16c7000 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId ca000 commId 0x5de02ca6db1d80d3 - Init START
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6862:6975 [3] NCCL INFO comm 0x7f7ad56c9b60 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId e3000 commId 0x5de02ca6db1d80d3 - Init START
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6860:6967 [1] NCCL INFO comm 0x7f19ed6c9ed0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 65000 commId 0x5de02ca6db1d80d3 - Init START
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6859:6969 [0] NCCL INFO comm 0x7f298d6ca730 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 4b000 commId 0x5de02ca6db1d80d3 - Init START
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6861:6979 [2] NCCL INFO Setting affinity for GPU 2 to ffffff00,0000ffff,ff000000
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6861:6979 [2] NCCL INFO NVLS multicast support is not available on dev 2
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6860:6967 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffff0000,00ffffff
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6860:6967 [1] NCCL INFO NVLS multicast support is not available on dev 1
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6859:6969 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6859:6969 [0] NCCL INFO NVLS multicast support is not available on dev 0
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6862:6975 [3] NCCL INFO Setting affinity for GPU 3 to ffffff00,0000ffff,ff000000
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6862:6975 [3] NCCL INFO NVLS multicast support is not available on dev 3
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6861:6979 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6861:6979 [2] NCCL INFO P2P Chunksize set to 131072
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6862:6975 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6862:6975 [3] NCCL INFO P2P Chunksize set to 131072
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6859:6969 [0] NCCL INFO Channel 00/02 :    0   1   2   3
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6859:6969 [0] NCCL INFO Channel 01/02 :    0   1   2   3
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6859:6969 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6859:6969 [0] NCCL INFO P2P Chunksize set to 131072
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6860:6967 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6860:6967 [1] NCCL INFO P2P Chunksize set to 131072
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6859:6969 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6862:6975 [3] NCCL INFO Channel 00 : 3[3] -> 0[0] via SHM/direct/direct
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6860:6967 [1] NCCL INFO Channel 00 : 1[1] -> 2[2] via SHM/direct/direct
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6861:6979 [2] NCCL INFO Channel 00 : 2[2] -> 3[3] via SHM/direct/direct
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6859:6969 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6862:6975 [3] NCCL INFO Channel 01 : 3[3] -> 0[0] via SHM/direct/direct
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6860:6967 [1] NCCL INFO Channel 01 : 1[1] -> 2[2] via SHM/direct/direct
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6861:6979 [2] NCCL INFO Channel 01 : 2[2] -> 3[3] via SHM/direct/direct
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6859:6969 [0] NCCL INFO Connected all rings
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6862:6975 [3] NCCL INFO Connected all rings
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6860:6967 [1] NCCL INFO Connected all rings
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6861:6979 [2] NCCL INFO Connected all rings
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6862:6975 [3] NCCL INFO Channel 00 : 3[3] -> 2[2] via SHM/direct/direct
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6862:6975 [3] NCCL INFO Channel 01 : 3[3] -> 2[2] via SHM/direct/direct
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6860:6967 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6860:6967 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6861:6979 [2] NCCL INFO Channel 00 : 2[2] -> 1[1] via SHM/direct/direct
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6861:6979 [2] NCCL INFO Channel 01 : 2[2] -> 1[1] via SHM/direct/direct
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6859:6969 [0] NCCL INFO Connected all trees
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6859:6969 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6859:6969 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6860:6967 [1] NCCL INFO Connected all trees
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6860:6967 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6860:6967 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6862:6975 [3] NCCL INFO Connected all trees
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6862:6975 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6862:6975 [3] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6861:6979 [2] NCCL INFO Connected all trees
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6861:6979 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6861:6979 [2] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6861:6979 [2] NCCL INFO comm 0x7f52f16c7000 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId ca000 commId 0x5de02ca6db1d80d3 - Init COMPLETE
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6859:6969 [0] NCCL INFO comm 0x7f298d6ca730 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 4b000 commId 0x5de02ca6db1d80d3 - Init COMPLETE
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6860:6967 [1] NCCL INFO comm 0x7f19ed6c9ed0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 65000 commId 0x5de02ca6db1d80d3 - Init COMPLETE
archimedes-open-api-70b-1325343-5875b79c44-lzlpt:6862:6975 [3] NCCL INFO comm 0x7f7ad56c9b60 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId e3000 commId 0x5de02ca6db1d80d3 - Init COMPLETE
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +32950, now: CPU 0, GPU 32950 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +32950, now: CPU 0, GPU 32950 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +32950, now: CPU 0, GPU 32950 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +32950, now: CPU 0, GPU 32950 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 33399, GPU 36206 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 33399, GPU 36206 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 33399, GPU 36206 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 33399, GPU 36206 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 33399, GPU 36214 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 33399, GPU 36214 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 33399, GPU 36214 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 33399, GPU 36214 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 32950 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 33402, GPU 36224 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 32950 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 33402, GPU 36224 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 32950 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 32950 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 33403, GPU 36234 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 33402, GPU 36224 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 33402, GPU 36224 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 33403, GPU 36234 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 32950 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 33403, GPU 36234 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 33403, GPU 36234 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 32950 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 32950 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 32950 (MiB)
[TensorRT-LLM][INFO] Using 486987 tokens in paged KV cache.
[TensorRT-LLM][INFO] Using 486987 tokens in paged KV cache.
[TensorRT-LLM][INFO] Using 486987 tokens in paged KV cache.
[TensorRT-LLM][INFO] Using 486987 tokens in paged KV cache.
I1121 06:31:38.402380 6859 model_lifecycle.cc:819] successfully loaded 'tensorrt_llm'
I1121 06:31:38.402859 6859 model_lifecycle.cc:462] loading: ensemble:1
I1121 06:31:38.403217 6859 model_lifecycle.cc:819] successfully loaded 'ensemble'
I1121 06:31:38.403287 6859 server.cc:604] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+
I1121 06:31:38.403335 6859 server.cc:631] 
+-------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend     | Path                                                            | Config                                                                                                                                                                                             |
+-------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| pytorch     | /opt/tritonserver/backends/pytorch/libtriton_pytorch.so         | {}                                                                                                                                                                                                 |
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}}                                     |
| python      | /opt/tritonserver/backends/python/libtriton_python.so           | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","shm-region-prefix-name":"prefix0_","default-max-batch-size":"4"}} |
+-------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I1121 06:31:38.403363 6859 server.cc:674] 
+----------------+---------+--------+
| Model          | Version | Status |
+----------------+---------+--------+
| ensemble       | 1       | READY  |
| postprocessing | 1       | READY  |
| preprocessing  | 1       | READY  |
| tensorrt_llm   | 1       | READY  |
+----------------+---------+--------+
I1121 06:31:38.471530 6859 metrics.cc:810] Collecting metrics for GPU 0: NVIDIA A100 80GB PCIe
I1121 06:31:38.471559 6859 metrics.cc:810] Collecting metrics for GPU 1: NVIDIA A100 80GB PCIe
I1121 06:31:38.471566 6859 metrics.cc:810] Collecting metrics for GPU 2: NVIDIA A100 80GB PCIe
I1121 06:31:38.471573 6859 metrics.cc:810] Collecting metrics for GPU 3: NVIDIA A100 80GB PCIe
I1121 06:31:38.472091 6859 metrics.cc:703] Collecting CPU metrics
I1121 06:31:38.472250 6859 tritonserver.cc:2435] 
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                                           |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                                          |
| server_version                   | 2.37.0                                                                                                                                                                                                          |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0]         | /all_models/inflight_batcher_llm                                                                                                                                                                                |
| model_control_mode               | MODE_NONE                                                                                                                                                                                                       |
| strict_model_config              | 1                                                                                                                                                                                                               |
| rate_limit                       | OFF                                                                                                                                                                                                             |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                                       |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                                        |
| cuda_memory_pool_byte_size{1}    | 67108864                                                                                                                                                                                                        |
| cuda_memory_pool_byte_size{2}    | 67108864                                                                                                                                                                                                        |
| cuda_memory_pool_byte_size{3}    | 67108864                                                                                                                                                                                                        |
| min_supported_compute_capability | 6.0                                                                                                                                                                                                             |
| strict_readiness                 | 1                                                                                                                                                                                                               |
| exit_timeout                     | 30                                                                                                                                                                                                              |
| cache_enabled                    | 0                                                                                                                                                                                                               |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I1121 06:31:38.483720 6859 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001
I1121 06:31:38.484149 6859 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000
I1121 06:31:38.525310 6859 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002
open-api-70b-1325343-5875b79c44-lzlpt:6861:10158 [2] misc/strongstream.cc:343 NCCL WARN Cuda failure 'an illegal memory access was encountered'
open-api-70b-1325343-5875b79c44-lzlpt:6861:10158 [2] NCCL INFO enqueue.cc:1120 -> 1
open-api-70b-1325343-5875b79c44-lzlpt:6862:10157 [3] misc/strongstream.cc:343 NCCL WARN Cuda failure 'an illegal memory access was encountered'
open-api-70b-1325343-5875b79c44-lzlpt:6862:10157 [3] NCCL INFO enqueue.cc:1120 -> 1
open-api-70b-1325343-5875b79c44-lzlpt:6860:10156 [1] misc/strongstream.cc:343 NCCL WARN Cuda failure 'an illegal memory access was encountered'
open-api-70b-1325343-5875b79c44-lzlpt:6860:10156 [1] NCCL INFO enqueue.cc:1120 -> 1
open-api-70b-1325343-5875b79c44-lzlpt:6859:10107 [0] misc/strongstream.cc:343 NCCL WARN Cuda failure 'an illegal memory access was encountered'
open-api-70b-1325343-5875b79c44-lzlpt:6859:10107 [0] NCCL INFO enqueue.cc:1120 -> 1
[TensorRT-LLM][ERROR] 1: [pointWiseV2Helpers.cpp::launchPwgenKernel::267] Error Code 1: Cuda Driver (an illegal memory access was encountered)
[TensorRT-LLM][ERROR] 1: [pointWiseV2Helpers.cpp::launchPwgenKernel::267] Error Code 1: Cuda Driver (an illegal memory access was encountered)
[TensorRT-LLM][ERROR] Encountered an error in forward function: Executing TRT engine failed!
[TensorRT-LLM][WARNING] Step function failed, continuing.
[TensorRT-LLM][ERROR] Encountered an error in forward function: Executing TRT engine failed!
[TensorRT-LLM][ERROR] Encountered error for requestId 1476600348: Encountered an error in forward function: Executing TRT engine failed!
[TensorRT-LLM][ERROR] Encountered error for requestId 787608832: Encountered an error in forward function: Executing TRT engine failed!
[TensorRT-LLM][ERROR] Encountered error for requestId 853225507: Encountered an error in forward function: Executing TRT engine failed!
[TensorRT-LLM][WARNING] Step function failed, continuing.
[TensorRT-LLM][ERROR] 1: [pointWiseV2Helpers.cpp::launchPwgenKernel::267] Error Code 1: Cuda Driver (an illegal memory access was encountered)
[TensorRT-LLM][ERROR] Encountered an error in forward function: Executing TRT engine failed!
[TensorRT-LLM][WARNING] Step function failed, continuing.
[TensorRT-LLM][ERROR] 1: [pointWiseV2Helpers.cpp::launchPwgenKernel::267] Error Code 1: Cuda Driver (an illegal memory access was encountered)
[TensorRT-LLM][ERROR] Encountered an error in forward function: Executing TRT engine failed!
[TensorRT-LLM][WARNING] Step function failed, continuing.
[TensorRT-LLM][ERROR] Encountered an error in forward function: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaEventSynchronize(get()): an illegal memory access was encountered (/app/tensorrt_llm/cpp/include/tensorrt_llm/runtime/cudaEvent.h:66)
1       0x7f7ac73f7695 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x35695) [0x7f7ac73f7695]
2       0x7f7ac745b5d3 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x995d3) [0x7f7ac745b5d3]
3       0x7f7ac7423415 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x61415) [0x7f7ac7423415]
4       0x7f7ac7413aa1 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x51aa1) [0x7f7ac7413aa1]
5       0x7f7ac7415902 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x53902) [0x7f7ac7415902]
6       0x7f7c63272253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f7c63272253]
7       0x7f7c63002b43 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7f7c63002b43]
8       0x7f7c63094a00 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126a00) [0x7f7c63094a00]
[TensorRT-LLM][WARNING] Step function failed, continuing.
[TensorRT-LLM][ERROR] Encountered an error in forward function: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaEventSynchronize(get()): an illegal memory access was encountered (/app/tensorrt_llm/cpp/include/tensorrt_llm/runtime/cudaEvent.h:66)
1       0x7f297f3f7695 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x35695) [0x7f297f3f7695]
2       0x7f297f45b5d3 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x995d3) [0x7f297f45b5d3]
3       0x7f297f423415 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x61415) [0x7f297f423415]
4       0x7f297f413aa1 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x51aa1) [0x7f297f413aa1]
5       0x7f297f415902 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x53902) [0x7f297f415902]
6       0x7f2b1dc72253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f2b1dc72253]
7       0x7f2b1da02b43 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7f2b1da02b43]
8       0x7f2b1da94a00 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126a00) [0x7f2b1da94a00]
0xymoro commented 9 months ago

Got it here as well, also TP4 70b, I'm able to reproduce it over large parallel batch requests. https://github.com/NVIDIA/TensorRT-LLM/issues/427

@basiccoder are you using a quantization? I'm using fp8 but seems like the issue doesn't have to do with the quants.

@byshiue think you're right it may be easier to debug on smaller but perhaps the issue is the tensor parallelism but if the issue is with the tensor parallelism it might not be possible to get around it. Wonder if it's just with Llama 70b too or some working of MQA? Still can send it over to you.

byshiue commented 9 months ago

@BasicCoder @0xymoro Does this issue happen on python runtime?

BasicCoder commented 9 months ago

@BasicCoder @0xymoro Does this issue happen on python runtime?

Sorry, I did not try the python runtime. This error appeared after the service had been running stably for a while. After I turned off all possible optimization options and kept only these options:

python build.py --world_size 4 \
--tp_size 4 \
--model_dir /70B_models \
--dtype float16 \
--max_batch_size 4 \
--max_input_len 4096 \
--max_output_len 4096 \
--max_beam_width 1 \
--use_gpt_attention_plugin float16 \
--use_gemm_plugin float16 \
--use_rmsnorm_plugin float16 \
--use_inflight_batching \
--paged_kv_cache \
--remove_input_padding \
--output_dir /70B_models/trt_engines/fp16/4-gpu/ 

a new ERROR appeared:

terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] CUDA runtime error in cublasLtMatmul(getCublasLtHandle(), mOperationDesc, alpha, A, mADesc, B, mBDesc, beta, C, mCDesc, C, mCDesc, (hasAlgo ? (&algo) : NULL), mCublasWorkspace, workspaceSize, mStream): CUBLAS_STATUS_EXECUTION_FAILED (/app/tensorrt_llm/cpp/tensorrt_llm/common/cublasM
MWrapper.cpp:140)
1       0x7f4cf6b89a6e /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so.9(+0x8ea6e) [0x7f4cf6b89a6e]
2       0x7f4cf6bce7c5 /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so.9(+0xd37c5) [0x7f4cf6bce7c5]
3       0x7f4cf6bceb7b /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so.9(+0xd3b7b) [0x7f4cf6bceb7b]
4       0x7f4cf6ba7e81 /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so.9(+0xace81) [0x7f4cf6ba7e81]
5       0x7f4cf6ba8767 tensorrt_llm::plugins::GemmPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) + 2636       0x7f4d34afbfc9 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10aefc9) [0x7f4d34afbfc9]
7       0x7f4d34abee04 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1071e04) [0x7f4d34abee04]
8       0x7f4d34ac09a0 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10739a0) [0x7f4d34ac09a0]
9       0x7f4d4341ec37 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x5cc37) [0x7f4d4341ec37]
10      0x7f4d434203ed /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x5e3ed) [0x7f4d434203ed]
11      0x7f4d4342312d /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x6112d) [0x7f4d4342312d]
12      0x7f4d43413aa1 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x51aa1) [0x7f4d43413aa1]
13      0x7f4d43415902 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x53902) [0x7f4d43415902]14      0x7f4ee0e72253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f4ee0e72253]
15      0x7f4ee0c02b43 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7f4ee0c02b43]
16      0x7f4ee0c94a00 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126a00) [0x7f4ee0c94a00]
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] CUDA runtime error in cublasLtMatmul(getCublasLtHandle(), mOperationDesc, alpha, A, mADesc, B, mBDesc, beta, C, mCDesc, C, mCDesc, (hasAlgo ? (&algo) : NULL), mCublasWorkspace, workspaceSize, mStream): CUBLAS_STATUS_EXECUTION_FAILED (/app/tensorrt_llm/cpp/tensorrt_llm/common/cublasM
MWrapper.cpp:140)
1       0x7f9a26b89a6e /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so.9(+0x8ea6e) [0x7f9a26b89a6e]
2       0x7f9a26bce7c5 /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so.9(+0xd37c5) [0x7f9a26bce7c5]
3       0x7f9a26bceb7b /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so.9(+0xd3b7b) [0x7f9a26bceb7b]
4       0x7f9a26ba7e81 /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so.9(+0xace81) [0x7f9a26ba7e81]
5       0x7f9a26ba8767 tensorrt_llm::plugins::GemmPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) + 263
6       0x7f9a64afbfc9 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10aefc9) [0x7f9a64afbfc9]
7       0x7f9a64abee04 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1071e04) [0x7f9a64abee04]
8       0x7f9a64ac09a0 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10739a0) [0x7f9a64ac09a0]
9       0x7f9a7341ec37 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x5cc37) [0x7f9a7341ec37]
10      0x7f9a734203ed /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x5e3ed) [0x7f9a734203ed]
11      0x7f9a7342312d /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x6112d) [0x7f9a7342312d]
12      0x7f9a73413aa1 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x51aa1) [0x7f9a73413aa1]
13      0x7f9a73415902 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x53902) [0x7f9a73415902]
14      0x7f9c0f872253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f9c0f872253]
15      0x7f9c0f602b43 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7f9c0f602b43]
16      0x7f9c0f694a00 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126a00) [0x7f9c0f694a00]
Signal (6) received.
[archimedes-open-api-70b-1325343-5875b79c44-m6l66:06554] *** Process received signal ***
[archimedes-open-api-70b-1325343-5875b79c44-m6l66:06554] Signal: Aborted (6)
[archimedes-open-api-70b-1325343-5875b79c44-m6l66:06554] Signal code:  (-6)

This new error also appeared after the service had been running stably for some time.

BasicCoder commented 9 months ago

After nearly two weeks of testing, the most stable engine construction method is:

python build.py --world_size 4 \
--tp_size 4 \
--model_dir /70B_models \
--dtype float16 \
--max_batch_size 4 \
--max_input_len 4096 \
--max_output_len 4096 \
--max_beam_width 1 \
--use_gpt_attention_plugin float16 \
--use_gemm_plugin float16 \
--use_inflight_batching \
--paged_kv_cache \
--remove_input_padding \
--output_dir /70B_models/trt_engines/fp16/4-gpu/ 

@BasicCoder @0xymoro Does this issue happen on python runtime?

Sorry, I did not try the python runtime. This error appeared after the service had been running stably for a while. After I turned off all possible optimization options and kept only these options:

python build.py --world_size 4 \
--tp_size 4 \
--model_dir /70B_models \
--dtype float16 \
--max_batch_size 4 \
--max_input_len 4096 \
--max_output_len 4096 \
--max_beam_width 1 \
--use_gpt_attention_plugin float16 \
--use_gemm_plugin float16 \
--use_rmsnorm_plugin float16 \
--use_inflight_batching \
--paged_kv_cache \
--remove_input_padding \
--output_dir /70B_models/trt_engines/fp16/4-gpu/ 

a new ERROR appeared:

terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] CUDA runtime error in cublasLtMatmul(getCublasLtHandle(), mOperationDesc, alpha, A, mADesc, B, mBDesc, beta, C, mCDesc, C, mCDesc, (hasAlgo ? (&algo) : NULL), mCublasWorkspace, workspaceSize, mStream): CUBLAS_STATUS_EXECUTION_FAILED (/app/tensorrt_llm/cpp/tensorrt_llm/common/cublasM
MWrapper.cpp:140)
1       0x7f4cf6b89a6e /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so.9(+0x8ea6e) [0x7f4cf6b89a6e]
2       0x7f4cf6bce7c5 /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so.9(+0xd37c5) [0x7f4cf6bce7c5]
3       0x7f4cf6bceb7b /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so.9(+0xd3b7b) [0x7f4cf6bceb7b]
4       0x7f4cf6ba7e81 /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so.9(+0xace81) [0x7f4cf6ba7e81]
5       0x7f4cf6ba8767 tensorrt_llm::plugins::GemmPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) + 2636       0x7f4d34afbfc9 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10aefc9) [0x7f4d34afbfc9]
7       0x7f4d34abee04 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1071e04) [0x7f4d34abee04]
8       0x7f4d34ac09a0 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10739a0) [0x7f4d34ac09a0]
9       0x7f4d4341ec37 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x5cc37) [0x7f4d4341ec37]
10      0x7f4d434203ed /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x5e3ed) [0x7f4d434203ed]
11      0x7f4d4342312d /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x6112d) [0x7f4d4342312d]
12      0x7f4d43413aa1 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x51aa1) [0x7f4d43413aa1]
13      0x7f4d43415902 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x53902) [0x7f4d43415902]14      0x7f4ee0e72253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f4ee0e72253]
15      0x7f4ee0c02b43 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7f4ee0c02b43]
16      0x7f4ee0c94a00 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126a00) [0x7f4ee0c94a00]
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] CUDA runtime error in cublasLtMatmul(getCublasLtHandle(), mOperationDesc, alpha, A, mADesc, B, mBDesc, beta, C, mCDesc, C, mCDesc, (hasAlgo ? (&algo) : NULL), mCublasWorkspace, workspaceSize, mStream): CUBLAS_STATUS_EXECUTION_FAILED (/app/tensorrt_llm/cpp/tensorrt_llm/common/cublasM
MWrapper.cpp:140)
1       0x7f9a26b89a6e /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so.9(+0x8ea6e) [0x7f9a26b89a6e]
2       0x7f9a26bce7c5 /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so.9(+0xd37c5) [0x7f9a26bce7c5]
3       0x7f9a26bceb7b /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so.9(+0xd3b7b) [0x7f9a26bceb7b]
4       0x7f9a26ba7e81 /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so.9(+0xace81) [0x7f9a26ba7e81]
5       0x7f9a26ba8767 tensorrt_llm::plugins::GemmPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) + 263
6       0x7f9a64afbfc9 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10aefc9) [0x7f9a64afbfc9]
7       0x7f9a64abee04 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1071e04) [0x7f9a64abee04]
8       0x7f9a64ac09a0 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10739a0) [0x7f9a64ac09a0]
9       0x7f9a7341ec37 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x5cc37) [0x7f9a7341ec37]
10      0x7f9a734203ed /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x5e3ed) [0x7f9a734203ed]
11      0x7f9a7342312d /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x6112d) [0x7f9a7342312d]
12      0x7f9a73413aa1 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x51aa1) [0x7f9a73413aa1]
13      0x7f9a73415902 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x53902) [0x7f9a73415902]
14      0x7f9c0f872253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f9c0f872253]
15      0x7f9c0f602b43 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7f9c0f602b43]
16      0x7f9c0f694a00 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126a00) [0x7f9c0f694a00]
Signal (6) received.
[archimedes-open-api-70b-1325343-5875b79c44-m6l66:06554] *** Process received signal ***
[archimedes-open-api-70b-1325343-5875b79c44-m6l66:06554] Signal: Aborted (6)
[archimedes-open-api-70b-1325343-5875b79c44-m6l66:06554] Signal code:  (-6)

This new error also appeared after the service had been running stably for some time.

This error may be solved by setting kv_cache_free_gpu_mem_fraction=0.80 to reserve more buffers(default=0.85).

0xymoro commented 8 months ago

Got some free time and went back and did some older tests to see if kv cache free gpu mem did it. It did not. I manually set it to 0.65 and below, and it still errored out when stress tested with long context with many in parallel at at time.

The only difference in my engine that failed, and @BasicCoder's that they said was stable is that I have --enable_context_fmha. It might be a pretty good bet to isolate the issue on this, as to what the flash attention is doing wrong here in you guys' investigations.

My build args that resulted in fails:

--dtype float16 --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --remove_input_padding --use_inflight_batching --paged_kv_cache --enable_context_fmha --max_input_len 4096 --max_batch_size 128 --strongly_typed --enable_fp8 --fp8_kv_cache --world_size 4 --tp_size 4 --parallel_build

The fp8 doesn't seem to be the issue given the error happened the same way across fp8/non fp8. Please keep in mind the version is still month old, but I did not see updates in the newer updates that would suggest this issue has been fixed so I just used the old engine and TRTLLM/Triton images I had built a month ago, based on https://github.com/NVIDIA/TensorRT-LLM/commit/6755a3f077bc39d51b74bbcb50403f54663cc8dc

@byshiue @juney-nvidia @jdemouth-nvidia

0xymoro commented 8 months ago

Update on this. Tried TP2, and no longer get any "illegal memory access" errors. However, when end_id = 2 is supplied from client side, the system will freeze indefinitely at high context + large batchsizes (20).

Removing end_id made it work, I tried 500 requests at 20 in parallel and none of it froze.

I can make a separate issue for end_id but it may also have to do with the older version that I built the engine with, but can confirm TP2 doesn't have the same behavior as TP4 in terms of memory access errors (at least not explicitly).