LLM configured host_cache_size is invalid

Hi,

I'm testing llama3-70b model with smoothquant on a 4 x RTX-4090 GPUs node. Due to the memory restriction, I used host_cache_size parameter for offloading kv cache to host. Then I hit 2 issues:

1. From the logs, seems this config doesn't take effect.

Log snippet:

Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true

2. In this situation, When I keep pushing inference requests, the service crashed after a little while.

crash msg:

    |   File "/mnt/workspace/trtllm_infer.py", line 144, in stream_generate
    |     promise = self.llm.generate_async(prompt, sampling_params=sampling_config, streaming=True)
    |   File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/hlapi/llm.py", line 212, in generate_async
    |     result = self._executor.generate_async(
    |   File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor.py", line 314, in generate_async
    |     result = self.submit(
    |   File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor.py", line 784, in submit
    |     self.request_queue.put(request)
    |   File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor.py", line 626, in put
    |     self.conn.send(obj)
    |   File "/usr/lib/python3.10/multiprocessing/connection.py", line 206, in send
    |     self._send_bytes(_ForkingPickler.dumps(obj))
    |   File "/usr/lib/python3.10/multiprocessing/connection.py", line 411, in _send_bytes
    |     self._send(header + buf)
    |   File "/usr/lib/python3.10/multiprocessing/connection.py", line 368, in _send
    |     n = write(self._handle, buf)
    | BrokenPipeError: [Errno 32] Broken pipe

Could you help check it? Scripts for converting & building & serving listed as below.

model convert:

python ../quantization/quantize.py --model_dir /models--meta-llama--llama3-70b-instruct/ \
                                   --dtype float16 \
                                   --qformat fp8 \
                                   --kv_cache_dtype fp8 \
                                   --output_dir /models--meta-llama--llama3-70b-instruct/quant \
                                   --calib_size 512 \
                                   --tp_size 4

model build:

CUDA_VISIBLE_DEVICES=4,5,6,7 trtllm-build --checkpoint_dir /models--meta-llama--llama3-70b-instruct/quant \
             --output_dir /ljay/llm-models/models--meta-llama--llama3-70b-instruct/tp4-fp8-pagedcontext \
             --gemm_plugin auto \
             --max_input_len 8191 \
             --use_paged_context_fmha enable \
             --use_fp8_context_fmha enable \
             --workers 4

LLM instance:

            llm_args = {
                "tokenizer":
                    tokenizer,
                "kv_cache_config":
                    KvCacheConfig(
                        free_gpu_memory_fraction=0.9, enable_block_reuse=runtime_opts.get("enable_block_reuse", False),
                        host_cache_size=8192
                    ),
                "enable_chunked_context":
                    runtime_opts.get("enable_chunked_context", False),
                "scheduler_config":
                    SchedulerConfig(CapacitySchedulerPolicy.GUARANTEED_NO_EVICT)
            }
            self.llm = LLM(self.engine_dir, **llm_args)

TRT-LLM full logs:

[TensorRT-LLM] TensorRT-LLM version: 0.13.0.dev2024082700
[TensorRT-LLM] TensorRT-LLM version: 0.13.0.dev2024082700
[TensorRT-LLM] TensorRT-LLM version: 0.13.0.dev2024082700
[TensorRT-LLM] TensorRT-LLM version: 0.13.0.dev2024082700
[TensorRT-LLM][INFO] Engine version 0.13.0.dev2024082700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.13.0.dev2024082700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.13.0.dev2024082700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.13.0.dev2024082700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 1
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 2
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 0
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 3
[TensorRT-LLM][INFO] Rank 3 is using GPU 3
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] Rank 1 is using GPU 1
[TensorRT-LLM][INFO] Rank 2 is using GPU 2
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 256
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 256
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 8192
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (8192) * 80
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8191  = maxSequenceLen - 1 since chunked context is enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: 8192 = maxSequenceLen.
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][WARNING] Device 1 peer access Device 0 is not available.
[TensorRT-LLM][WARNING] Device 1 peer access Device 2 is not available.
[TensorRT-LLM][WARNING] Device 1 peer access Device 3 is not available.
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 256
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 256
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 8192
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (8192) * 80
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8191  = maxSequenceLen - 1 since chunked context is enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: 8192 = maxSequenceLen.
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 2 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 3 is not available.
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 256
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 256
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 8192
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (8192) * 80
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8191  = maxSequenceLen - 1 since chunked context is enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: 8192 = maxSequenceLen.
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][WARNING] Device 2 peer access Device 0 is not available.
[TensorRT-LLM][WARNING] Device 2 peer access Device 1 is not available.
[TensorRT-LLM][WARNING] Device 2 peer access Device 3 is not available.
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 256
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 256
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 8192
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (8192) * 80
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8191  = maxSequenceLen - 1 since chunked context is enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: 8192 = maxSequenceLen.
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][WARNING] Device 3 peer access Device 0 is not available.
[TensorRT-LLM][WARNING] Device 3 peer access Device 1 is not available.
[TensorRT-LLM][WARNING] Device 3 peer access Device 2 is not available.
[TensorRT-LLM][INFO] Loaded engine size: 17360 MiB
[TensorRT-LLM][INFO] Loaded engine size: 17360 MiB
[TensorRT-LLM][INFO] Loaded engine size: 17360 MiB
[TensorRT-LLM][INFO] Loaded engine size: 17360 MiB
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO Bootstrap : Using eth0:10.45.129.73<0>
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.20.5+cuda12.4
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO cudaDriverVersion 12020
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO Bootstrap : Using eth0:10.45.129.73<0>
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO NET/IB : No device found.
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO NET/Socket : Using [0]eth0:10.45.129.73<0>
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO Using non-device net plugin version 0
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO Using network Socket
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO NET/IB : No device found.
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO NET/Socket : Using [0]eth0:10.45.129.73<0>
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO Using non-device net plugin version 0
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO Using network Socket
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO cudaDriverVersion 12020
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO Bootstrap : Using eth0:10.45.129.73<0>
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO cudaDriverVersion 12020
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO Bootstrap : Using eth0:10.45.129.73<0>
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO NET/IB : No device found.
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO NET/Socket : Using [0]eth0:10.45.129.73<0>
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO Using non-device net plugin version 0
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO Using network Socket
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO NET/IB : No device found.
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO NET/Socket : Using [0]eth0:10.45.129.73<0>
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO Using non-device net plugin version 0
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO Using network Socket
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO comm 0x5623639cf730 rank 3 nranks 4 cudaDev 3 nvmlDev 7 busId e9000 commId 0x7700714812322fad - Init START
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO comm 0x558ebd1e7de0 rank 1 nranks 4 cudaDev 1 nvmlDev 5 busId e6000 commId 0x7700714812322fad - Init START
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO comm 0x55757a7e6830 rank 2 nranks 4 cudaDev 2 nvmlDev 6 busId e8000 commId 0x7700714812322fad - Init START
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO comm 0x55a75342cb70 rank 0 nranks 4 cudaDev 0 nvmlDev 4 busId e5000 commId 0x7700714812322fad - Init START
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO P2P is disabled between connected GPUs 5 and 4. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO P2P is disabled between connected GPUs 6 and 4. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO P2P is disabled between connected GPUs 5 and 4. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO P2P is disabled between connected GPUs 6 and 4. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO P2P is disabled between connected GPUs 7 and 4. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO P2P is disabled between connected GPUs 7 and 4. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO P2P is disabled between connected GPUs 4 and 5. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO P2P is disabled between connected GPUs 6 and 5. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO P2P is disabled between connected GPUs 7 and 5. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO P2P is disabled between connected GPUs 4 and 6. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO P2P is disabled between connected GPUs 4 and 5. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO P2P is disabled between connected GPUs 6 and 5. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO P2P is disabled between connected GPUs 7 and 5. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO P2P is disabled between connected GPUs 4 and 6. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO P2P is disabled between connected GPUs 5 and 6. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO P2P is disabled between connected GPUs 7 and 6. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO P2P is disabled between connected GPUs 5 and 6. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO P2P is disabled between connected GPUs 7 and 6. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO P2P is disabled between connected GPUs 4 and 7. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO P2P is disabled between connected GPUs 5 and 7. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO P2P is disabled between connected GPUs 6 and 7. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO P2P is disabled between connected GPUs 4 and 7. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO P2P is disabled between connected GPUs 5 and 7. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO P2P is disabled between connected GPUs 6 and 7. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO P2P is disabled between connected GPUs 5 and 4. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO P2P is disabled between connected GPUs 6 and 4. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO P2P is disabled between connected GPUs 7 and 4. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO P2P is disabled between connected GPUs 4 and 5. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO P2P is disabled between connected GPUs 6 and 5. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO P2P is disabled between connected GPUs 5 and 4. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO P2P is disabled between connected GPUs 6 and 4. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO P2P is disabled between connected GPUs 7 and 4. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO P2P is disabled between connected GPUs 7 and 5. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO P2P is disabled between connected GPUs 4 and 6. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO P2P is disabled between connected GPUs 5 and 6. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO P2P is disabled between connected GPUs 7 and 6. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO P2P is disabled between connected GPUs 4 and 5. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO P2P is disabled between connected GPUs 6 and 5. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO P2P is disabled between connected GPUs 4 and 7. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO P2P is disabled between connected GPUs 5 and 7. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO P2P is disabled between connected GPUs 6 and 7. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO P2P is disabled between connected GPUs 7 and 5. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO P2P is disabled between connected GPUs 4 and 6. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO P2P is disabled between connected GPUs 5 and 6. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO P2P is disabled between connected GPUs 7 and 6. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO P2P is disabled between connected GPUs 4 and 7. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO P2P is disabled between connected GPUs 5 and 7. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO P2P is disabled between connected GPUs 6 and 7. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO Setting affinity for GPU 7 to ffffffff,00000000,ffffffff,00000000
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO NVLS multicast support is not available on dev 3
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO Setting affinity for GPU 5 to ffffffff,00000000,ffffffff,00000000
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO NVLS multicast support is not available on dev 1
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO P2P is disabled between connected GPUs 5 and 4. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO P2P is disabled between connected GPUs 6 and 4. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO P2P is disabled between connected GPUs 7 and 4. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO P2P is disabled between connected GPUs 4 and 5. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO P2P is disabled between connected GPUs 6 and 5. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO P2P is disabled between connected GPUs 7 and 5. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO P2P is disabled between connected GPUs 4 and 6. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO P2P is disabled between connected GPUs 5 and 6. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO P2P is disabled between connected GPUs 7 and 6. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO P2P is disabled between connected GPUs 4 and 7. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO P2P is disabled between connected GPUs 5 and 7. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO P2P is disabled between connected GPUs 6 and 7. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO P2P is disabled between connected GPUs 5 and 4. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO P2P is disabled between connected GPUs 6 and 4. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO P2P is disabled between connected GPUs 7 and 4. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO P2P is disabled between connected GPUs 4 and 5. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO P2P is disabled between connected GPUs 6 and 5. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO P2P is disabled between connected GPUs 7 and 5. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO P2P is disabled between connected GPUs 4 and 6. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO P2P is disabled between connected GPUs 5 and 6. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO P2P is disabled between connected GPUs 7 and 6. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO P2P is disabled between connected GPUs 4 and 7. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO P2P is disabled between connected GPUs 5 and 7. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO P2P is disabled between connected GPUs 6 and 7. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO P2P is disabled between connected GPUs 5 and 4. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO P2P is disabled between connected GPUs 6 and 4. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO P2P is disabled between connected GPUs 7 and 4. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO P2P is disabled between connected GPUs 4 and 5. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO P2P is disabled between connected GPUs 6 and 5. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO P2P is disabled between connected GPUs 7 and 5. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO P2P is disabled between connected GPUs 4 and 6. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO P2P is disabled between connected GPUs 5 and 6. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO P2P is disabled between connected GPUs 7 and 6. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO P2P is disabled between connected GPUs 4 and 7. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO P2P is disabled between connected GPUs 5 and 7. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO P2P is disabled between connected GPUs 6 and 7. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO P2P is disabled between connected GPUs 5 and 4. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO P2P is disabled between connected GPUs 6 and 4. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO P2P is disabled between connected GPUs 7 and 4. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO P2P is disabled between connected GPUs 4 and 5. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO NVLS multicast support is not available on dev 2
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO P2P is disabled between connected GPUs 6 and 5. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO P2P is disabled between connected GPUs 7 and 5. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO P2P is disabled between connected GPUs 4 and 6. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO P2P is disabled between connected GPUs 5 and 6. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO P2P is disabled between connected GPUs 7 and 6. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO P2P is disabled between connected GPUs 4 and 7. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO P2P is disabled between connected GPUs 5 and 7. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO P2P is disabled between connected GPUs 6 and 7. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO NVLS multicast support is not available on dev 0
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO comm 0x558ebd1e7de0 rank 1 nRanks 4 nNodes 1 localRanks 4 localRank 1 MNNVL 0
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO comm 0x55a75342cb70 rank 0 nRanks 4 nNodes 1 localRanks 4 localRank 0 MNNVL 0
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO comm 0x55757a7e6830 rank 2 nRanks 4 nNodes 1 localRanks 4 localRank 2 MNNVL 0
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO comm 0x5623639cf730 rank 3 nRanks 4 nNodes 1 localRanks 4 localRank 3 MNNVL 0
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO P2P Chunksize set to 131072
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO Channel 00/02 :    0   1   2   3
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO Channel 01/02 :    0   1   2   3
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO P2P Chunksize set to 131072
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO P2P Chunksize set to 131072
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO P2P Chunksize set to 131072
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO Channel 00 : 1[5] -> 2[6] via SHM/direct/direct
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO Channel 00 : 3[7] -> 0[4] via SHM/direct/direct
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO Channel 01 : 1[5] -> 2[6] via SHM/direct/direct
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO Channel 01 : 3[7] -> 0[4] via SHM/direct/direct
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO Channel 00 : 2[6] -> 3[7] via SHM/direct/direct
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO Channel 01 : 2[6] -> 3[7] via SHM/direct/direct
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO Channel 00 : 0[4] -> 1[5] via SHM/direct/direct
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO Channel 01 : 0[4] -> 1[5] via SHM/direct/direct
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO Connected all rings
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO Channel 00 : 3[7] -> 2[6] via SHM/direct/direct
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO Channel 01 : 3[7] -> 2[6] via SHM/direct/direct
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO Connected all rings
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO Connected all rings
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO Connected all rings
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO Channel 00 : 1[5] -> 0[4] via SHM/direct/direct
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO Channel 01 : 1[5] -> 0[4] via SHM/direct/direct
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO Channel 00 : 2[6] -> 1[5] via SHM/direct/direct
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO Channel 01 : 2[6] -> 1[5] via SHM/direct/direct
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO Connected all trees
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO Connected all trees
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO Connected all trees
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO Connected all trees
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
bcc-qianfan-gpu4090-offline-1:89402:89402 [0] NCCL INFO comm 0x55a75342cb70 rank 0 nranks 4 cudaDev 0 nvmlDev 4 busId e5000 commId 0x7700714812322fad - Init COMPLETE
bcc-qianfan-gpu4090-offline-1:89403:89403 [1] NCCL INFO comm 0x558ebd1e7de0 rank 1 nranks 4 cudaDev 1 nvmlDev 5 busId e6000 commId 0x7700714812322fad - Init COMPLETE
bcc-qianfan-gpu4090-offline-1:89404:89404 [2] NCCL INFO comm 0x55757a7e6830 rank 2 nranks 4 cudaDev 2 nvmlDev 6 busId e8000 commId 0x7700714812322fad - Init COMPLETE
[TensorRT-LLM][INFO] Detecting local TP group for rank 1
bcc-qianfan-gpu4090-offline-1:89405:89405 [3] NCCL INFO comm 0x5623639cf730 rank 3 nranks 4 cudaDev 3 nvmlDev 7 busId e9000 commId 0x7700714812322fad - Init COMPLETE
[TensorRT-LLM][INFO] Detecting local TP group for rank 3
[TensorRT-LLM][INFO] Detecting local TP group for rank 0
[TensorRT-LLM][INFO] Detecting local TP group for rank 2
[TensorRT-LLM][INFO] TP group is intra-node for rank 0
[TensorRT-LLM][INFO] TP group is intra-node for rank 2
[TensorRT-LLM][INFO] TP group is intra-node for rank 1
[TensorRT-LLM][INFO] TP group is intra-node for rank 3
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 896.10 MiB for execution context memory.
[TensorRT-LLM][INFO] cudaDeviceCanAccessPeer failed for device: 0 peerDevice: 1
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 896.10 MiB for execution context memory.
[TensorRT-LLM][INFO] cudaDeviceCanAccessPeer failed for device: 3 peerDevice: 0
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 896.10 MiB for execution context memory.
[TensorRT-LLM][INFO] cudaDeviceCanAccessPeer failed for device: 1 peerDevice: 0
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 896.10 MiB for execution context memory.
[TensorRT-LLM][INFO] cudaDeviceCanAccessPeer failed for device: 2 peerDevice: 0
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 17328 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 17328 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 17328 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 17328 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 110.67 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 110.67 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 482.43 MB GPU memory for decoder.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 482.43 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 23.65 GiB, available: 4.74 GiB
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 23.65 GiB, available: 4.74 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 110.67 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 110.67 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 482.43 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 23.65 GiB, available: 4.74 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 482.43 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 23.65 GiB, available: 4.74 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 1748
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 1748
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 1748
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 1748
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 128
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 128
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 128
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 128
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 4.27 GiB for max tokens in paged KV cache (111872).
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 4.27 GiB for max tokens in paged KV cache (111872).
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 4.27 GiB for max tokens in paged KV cache (111872).
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 4.27 GiB for max tokens in paged KV cache (111872).
[TensorRT-LLM][WARNING] Device 1 peer access Device 0 is not available.
[TensorRT-LLM][WARNING] Device 1 peer access Device 2 is not available.
[TensorRT-LLM][WARNING] Device 1 peer access Device 3 is not available.
[TensorRT-LLM][WARNING] Device 3 peer access Device 0 is not available.
[TensorRT-LLM][WARNING] Device 3 peer access Device 1 is not available.
[TensorRT-LLM][WARNING] Device 3 peer access Device 2 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 2 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 3 is not available.
[TensorRT-LLM][WARNING] Device 2 peer access Device 0 is not available.
[TensorRT-LLM][WARNING] Device 2 peer access Device 1 is not available.
[TensorRT-LLM][WARNING] Device 2 peer access Device 3 is not available.

NVIDIA / TensorRT-LLM

LLM configured host_cache_size is invalid #2225