T5 out of memory - Githubissues

ydm-amazon commented 5 days ago

System Info

GPU 4 x A10G (EC2 g5.12xlarge) - memory 24GB
TRTLLM v0.12.0
torch 2.4.0
cuda 12.5.1
tensorrt 10.1
triton 24.04
modelopt 0.15

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

When running Flan-T5-XL, I get the following error:

Error:  2024-10-26 17:30:28 PyProcess - W-17921-test-stderr: [1,1]<stderr>:E1026 17:30:28.741998 17929 model_lifecycle.cc:638] failed to load '1c9ef0632367b6977384ee6179c5e0db60788391' version 1: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaMallocAsync(ptr, n, mCudaStream->get()): out of memory (/tmp/tensorrtllm_backend/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmBuffers.h:117)

In previous versions (e.g. v0.11.0), this problem did not happen. But in v0.12.0, it does.

To reproduce:

Convert checkpoint:

python3 convert_checkpoint.py --model_dir /tmp/flan-t5-xl/ --dtype bfloat16 --output_dir /tmp/trtllm_t5_ckpt/ --tp_size 4 --pp_size 1 --workers 4 --model_type t5

Build engine:

trtllm-build --checkpoint_dir /tmp/trtllm_t5_ckpt/encoder/ --log_level info --gemm_plugin bfloat16 --output_dir /tmp/.djl.ai/trtllm/07a75caac904529386b7aab25b551e6bd5d48bb0/9c57fb18d49e66cd903a6c1382315cbecd306463/1/encoder --workers 4 --gpt_attention_plugin bfloat16 --paged_kv_cache disable --context_fmha disable --max_beam_width 1 --remove_input_padding enable --use_paged_context_fmha disable --use_fp8_context_fmha disable --max_batch_size 256 --max_seq_len 1024 --max_num_tokens 16384 --enable_xqa disable --moe_plugin disable

trtllm-build --checkpoint_dir /tmp/trtllm_t5_ckpt/decoder/ --log_level info --gemm_plugin bfloat16 --output_dir /tmp/.djl.ai/trtllm/07a75caac904529386b7aab25b551e6bd5d48bb0/9c57fb18d49e66cd903a6c1382315cbecd306463/1/decoder --workers 4 --gpt_attention_plugin bfloat16 --paged_kv_cache enable --context_fmha disable --max_beam_width 1 --remove_input_padding enable --use_paged_context_fmha disable --use_fp8_context_fmha disable --max_batch_size 256 --max_seq_len 1024 --max_num_tokens 16384 --enable_xqa disable --moe_plugin disable --max_input_len 1 --max_encoder_input_len 1024

Then runtime:

mpirun --allow-run-as-root -np 4 python3 run.py --engine_dir /tmp/.djl.ai/trtllm/07a75caac904529386b7aab25b551e6bd5d48bb0/9c57fb18d49e66cd903a6c1382315cbecd306463/1/ --tokenizer_dir google/flan-t5-xl --max_output_len 64 --num_beams=1 --input_text "translate English to German: The house is wonderful."

The error will be:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 22.19 GiB of which 1.50 MiB is free. Process 641210 has 22.18 GiB memory in use. Of the allocated memory 0 bytes is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Expected behavior

Expected behavior is no OOM like v0.11.0. I tested with v0.11.0 with similar commands and the error did not happen (the only difference in the commands is following the breaking changes of v0.12, where tp_size and pp_size are set during checkpoint conversion, multi block mode during runtime, etc. I checked that all of these settings did not affect by also testing if adding --multi_block_mode true --enable_context_fmha_fp32_acc to run command would affect the result -- the OOM still happens.

actual behavior

Actual behavior is the OOM as described above. Please see the log attached in the 'additional notes' section for more information

additional notes

Full log when running TRTLLM

hello-11 commented 5 days ago

@ydm-amazon Which convert_checkpoint.py do you use?

ydm-amazon commented 4 days ago

@hello-11 Thanks for the quick response - I use the example one in examples/enc_dec.

siddvenk commented 4 days ago

Some more context, we specifically noticed that the gpu memory usage is about 2x higher with v12 compared to v11. I believe @ydm-amazon is observing something similar for v13, but i'll check and confirm. I reduced the batch size from 256 to 128, and with that we can successfully deploy, but memory is much higher.

For v11 vs v12 difference, i'm not sure if this behavior change is expected, or whether we are missing a configuration.

In v12, these are the commands we used to build the engine

# encoder
trtllm-build 
--checkpoint_dir /tmp/trtllm_t5_ckpt/encoder/ 
--log_level info 
--gemm_plugin bfloat16 
--output_dir /tmp/trtengine/google-flan-t5-xl/1/encoder 
--workers 4 
--gpt_attention_plugin bfloat16 
--paged_kv_cache disable 
--context_fmha disable 
--max_beam_width 1 
--remove_input_padding enable 
--use_paged_context_fmha disable 
--use_fp8_context_fmha disable 
--max_batch_size 256 
--max_input_len 1024 
--max_num_tokens 16384 
--enable_xqa disable 
--moe_plugin disable

# decoder
trtllm-build 
--checkpoint_dir /tmp/trtllm_t5_ckpt/decoder/ 
--log_level info 
--gemm_plugin bfloat16 
--output_dir /tmp/trtengine/google-flan-t5-xl/1/decoder 
--workers 4 
--gpt_attention_plugin bfloat16 
--paged_kv_cache enable 
--context_fmha disable 
--max_beam_width 1 
--remove_input_padding enable 
--use_paged_context_fmha disable 
--use_fp8_context_fmha disable 
--max_batch_size 128
--max_input_len 1 
--max_num_tokens 16384 
--enable_xqa disable 
--moe_plugin disable 
--max_encoder_input_len 1024 
--max_seq_len 1024

In v11 we used:

# encoder
 trtllm-build 
--tp_size 4 
--pp_size 1 
--checkpoint_dir /tmp/trtllm_t5_ckpt/encoder/ 
--log_level info 
--gemm_plugin bfloat16 
--output_dir /tmp/trtengine/google-flan-t5-xl/1/encoder 
--workers 4 
--gpt_attention_plugin bfloat16 
--paged_kv_cache disable 
--context_fmha disable 
--max_beam_width 1 
--remove_input_padding enable 
--use_custom_all_reduce disable 
--use_paged_context_fmha disable 
--use_fp8_context_fmha disable 
--max_batch_size 256 
--max_input_len 1024 
--max_num_tokens 16384 
--enable_xqa disable 
--moe_plugin disable

# decoder
trtllm-build 
--tp_size 4 
--pp_size 1 
--checkpoint_dir /tmp/trtllm_t5_ckpt/decoder/ 
--log_level info 
--gemm_plugin bfloat16 
--output_dir /tmp/trtengine/google-flan-t5-xl/1/decoder 
--workers 4 
--gpt_attention_plugin bfloat16 
--paged_kv_cache enable 
--context_fmha disable 
--max_beam_width 1 
--remove_input_padding enable 
--use_custom_all_reduce disable 
--use_paged_context_fmha disable 
--use_fp8_context_fmha disable 
--max_batch_size 256 
--max_input_len 1 
--max_num_tokens 16384 
--enable_xqa disable 
--moe_plugin disable 
--max_encoder_input_len 1024 
--max_seq_len 1024

In both cases, we're using tp4 deploying on L4 gpus.

v12 output from nvidia-smi

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05             Driver Version: 550.127.05     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      On  |   00000000:38:00.0 Off |                    0 |
| N/A   34C    P0             27W /   72W |   13426MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA L4                      On  |   00000000:3A:00.0 Off |                    0 |
| N/A   30C    P0             27W /   72W |   13426MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA L4                      On  |   00000000:3C:00.0 Off |                    0 |
| N/A   33C    P0             26W /   72W |   13426MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA L4                      On  |   00000000:3E:00.0 Off |                    0 |
| N/A   32C    P0             27W /   72W |   13426MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    133162      C   python3                                     12650MiB |
|    0   N/A  N/A    133163      C   python3                                       248MiB |
|    0   N/A  N/A    133164      C   python3                                       248MiB |
|    0   N/A  N/A    133165      C   python3                                       248MiB |
|    1   N/A  N/A    133162      C   python3                                       248MiB |
|    1   N/A  N/A    133163      C   python3                                     12650MiB |
|    1   N/A  N/A    133164      C   python3                                       248MiB |
|    1   N/A  N/A    133165      C   python3                                       248MiB |
|    2   N/A  N/A    133162      C   python3                                       248MiB |
|    2   N/A  N/A    133163      C   python3                                       248MiB |
|    2   N/A  N/A    133164      C   python3                                     12650MiB |
|    2   N/A  N/A    133165      C   python3                                       248MiB |
|    3   N/A  N/A    133162      C   python3                                       248MiB |
|    3   N/A  N/A    133163      C   python3                                       248MiB |
|    3   N/A  N/A    133164      C   python3                                       248MiB |
|    3   N/A  N/A    133165      C   python3                                     12650MiB |
+-----------------------------------------------------------------------------------------+

v11 output from nvidia-smi

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05             Driver Version: 550.127.05     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      On  |   00000000:38:00.0 Off |                    0 |
| N/A   38C    P0             27W /   72W |    7086MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA L4                      On  |   00000000:3A:00.0 Off |                    0 |
| N/A   35C    P0             26W /   72W |    7086MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA L4                      On  |   00000000:3C:00.0 Off |                    0 |
| N/A   37C    P0             26W /   72W |    7086MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA L4                      On  |   00000000:3E:00.0 Off |                    0 |
| N/A   37C    P0             27W /   72W |    7086MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    151463      C   python3                                      6314MiB |
|    0   N/A  N/A    151464      C   python3                                       248MiB |
|    0   N/A  N/A    151465      C   python3                                       248MiB |
|    0   N/A  N/A    151466      C   python3                                       248MiB |
|    1   N/A  N/A    151463      C   python3                                       248MiB |
|    1   N/A  N/A    151464      C   python3                                      6314MiB |
|    1   N/A  N/A    151465      C   python3                                       248MiB |
|    1   N/A  N/A    151466      C   python3                                       248MiB |
|    2   N/A  N/A    151463      C   python3                                       248MiB |
|    2   N/A  N/A    151464      C   python3                                       248MiB |
|    2   N/A  N/A    151465      C   python3                                      6314MiB |
|    2   N/A  N/A    151466      C   python3                                       248MiB |
|    3   N/A  N/A    151463      C   python3                                       248MiB |
|    3   N/A  N/A    151464      C   python3                                       248MiB |
|    3   N/A  N/A    151465      C   python3                                       248MiB |
|    3   N/A  N/A    151466      C   python3                                      6314MiB |
+-----------------------------------------------------------------------------------------+

jtchen0528 commented 1 day ago

Hi @ydm-amazon @siddvenk,

I cannot get access to 4xL4, but I tried with 4xA30, I cannot reproduce your error on either TRT-LLM v0.12 or main branch. There is tiny memory size diff between 4xL4 (23034MiB) and 4xA30 (24576MiB) but this is the closest I can have.

Currently the usage of the gpu memory is determined by --kv_cache_free_gpu_memory_fraction argument in run.py. The memory usage different might be a change of design in this feature. For enc-dec, it's more complicated, it's the value but divided by 2 (for enc and dec). See code here for reference.

I'm assuming you're using Tensorrt_LLM_backend to do inferencing. Please also check the value of kv_cache_free_gpu_memory setting in config.pbtxt.

I've attached my nvidia-smi running the engine.

Tue Nov  5 07:11:22 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A30                     On  |   00000000:81:00.0 Off |                    0 |
| N/A   30C    P0             30W /  165W |   23532MiB /  24576MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A30                     On  |   00000000:A1:00.0 Off |                    0 |
| N/A   31C    P0             35W /  165W |   23532MiB /  24576MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A30                     On  |   00000000:C1:00.0 Off |                    0 |
| N/A   31C    P0             32W /  165W |   23532MiB /  24576MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A30                     On  |   00000000:E1:00.0 Off |                    0 |
| N/A   30C    P0             31W /  165W |   23532MiB /  24576MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      2995      C   python3                                     22828MiB |
|    0   N/A  N/A      2996      C   python3                                       224MiB |
|    0   N/A  N/A      2997      C   python3                                       224MiB |
|    0   N/A  N/A      2998      C   python3                                       224MiB |
|    1   N/A  N/A      2995      C   python3                                       224MiB |
|    1   N/A  N/A      2996      C   python3                                     22828MiB |
|    1   N/A  N/A      2997      C   python3                                       224MiB |
|    1   N/A  N/A      2998      C   python3                                       224MiB |
|    2   N/A  N/A      2995      C   python3                                       224MiB |
|    2   N/A  N/A      2996      C   python3                                       224MiB |
|    2   N/A  N/A      2997      C   python3                                     22828MiB |
|    2   N/A  N/A      2998      C   python3                                       224MiB |
|    3   N/A  N/A      2995      C   python3                                       224MiB |
|    3   N/A  N/A      2996      C   python3                                       224MiB |
|    3   N/A  N/A      2997      C   python3                                       224MiB |
|    3   N/A  N/A      2998      C   python3                                     22828MiB |
+-----------------------------------------------------------------------------------------+

ydm-amazon commented 23 hours ago

Hi @jtchen0528,

Setting the kv_cache_free_gpu_memory fraction does not seem to fix the issue - I have tried values 0.9, 0.8, and 0.4 but they all give an OOM error.

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 22.19 GiB of which 1.50 MiB is free. Process 885415 has 22.18 GiB memory in use. Of the allocated memory 0 bytes is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

I have also double checked the config.pbtxt file in the artifacts to make sure it's set correctly.

jtchen0528 commented 22 hours ago

``Hi @ydm-amazon,

I see that you managed to reproduce the issue with TensorRT-LLM only. Let's focus on debugging TensorRT-LLM for now.

My step:

checked out tags/v0.12.0 here,
do a build from source on systems without GNU make section.
Followed the exact script in reproduction step. See no error after run.py

Since the examples/run.py script does not stop after launching the executor, so I put a time.sleep(60) before model.generate() to log the gpu allocation with nvidia-smi. Saw a 23532MiB / 24576MiB for each GPU since the allocation is based on the kv_cache_free_gpu_memory_fraction parameter. You should see a similar number with your L4 23034MiB.

Another possible cause is, I wonder if you set your mpi setting correctly when allocating nodes. I see the following log messages in the log you uploaded:

[TensorRT-LLM][WARNING] Device 3 peer access Device 0 is not available.

Setting the gpu id with CUDA_VISIBLE_DEVICE=0,1,2,3 before run.py, and try echo $CUDA_VISIBLE_DEVICES to ensure that the correct GPU is running. Also check the number of task per nodes when allocating nodes.

ydm-amazon commented 1 minute ago

I put the time.sleep(60) before the model.generate(), and this is the nvidia-smi:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10G                    On  | 00000000:00:1B.0 Off |                    0 |
|  0%   20C    P0              53W / 300W |  22684MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A10G                    On  | 00000000:00:1C.0 Off |                    0 |
|  0%   20C    P0              56W / 300W |  22684MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A10G                    On  | 00000000:00:1D.0 Off |                    0 |
|  0%   21C    P0              56W / 300W |  22684MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A10G                    On  | 00000000:00:1E.0 Off |                    0 |
|  0%   21C    P0              56W / 300W |  22684MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    892976      C   python3                                   22676MiB |
|    1   N/A  N/A    892977      C   python3                                   22676MiB |
|    2   N/A  N/A    892978      C   python3                                   22676MiB |
|    3   N/A  N/A    892979      C   python3                                   22676MiB |
+---------------------------------------------------------------------------------------+

It seems that there isn't enough space to hold the three other processes per gpu when generate starts. In your memory usage output it is 22828 MiB + 224 MiB 3, but for mine, 22676 MiB + 224 MiB 3 = 23348 MiB which is larger than the 23028 MiB the GPU can hold.

For CUDA_VISIBLE_DEVICES, it seems that setting it does not prevent the "...peer access ... is not available" warning. Below I have the log up till the point of time.sleep(60), in case it helps:

root@9fcc5d585e64:/opt/ml/model# CUDA_VISIBLE_DEVICES=0,1,2,3      
root@9fcc5d585e64:/opt/ml/model# echo $CUDA_VISIBLE_DEVICES
0,1,2,3
root@9fcc5d585e64:/opt/ml/model# mpirun --allow-run-as-root -np 4 python3 run.py --engine_dir /tmp/.djl.ai/trtllm/1d7f7fa222cc2d5726132bb27e06c64d65b384ec/google-flan-t5-xl/1/ --tokenizer_dir google/flan-t5-xl --max_output_len 64 --num_beams=1 --input_text "translate English to German: The house is wonderful."
[TensorRT-LLM] TensorRT-LLM version: 0.12.0
[TensorRT-LLM] TensorRT-LLM version: 0.12.0
[TensorRT-LLM] TensorRT-LLM version: 0.12.0
[TensorRT-LLM] TensorRT-LLM version: 0.12.0
[11/06/2024-17:35:01] [TRT-LLM] [W] This path is an encoder-decoder model. Using different handling.
[11/06/2024-17:35:01] [TRT-LLM] [W] This path is an encoder-decoder model. Using different handling.
[11/06/2024-17:35:01] [TRT-LLM] [W] This path is an encoder-decoder model. Using different handling.
[11/06/2024-17:35:01] [TRT-LLM] [W] This path is an encoder-decoder model. Using different handling.
/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 1
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 3
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 2
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 0
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 256
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 256
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 1024
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1024
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 16384
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 256
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 256
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 1024
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1024
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 16384
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 1024 = max_input_len (in trtllm-build args)
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 2
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 256
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 256
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 1024
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1024
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 1024 = max_input_len (in trtllm-build args)
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 3
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 256
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 256
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 1024
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1024
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 16384
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 16384
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 1024 = max_input_len (in trtllm-build args)
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 1024 = max_input_len (in trtllm-build args)
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][WARNING] Device 3 peer access Device 0 is not available.
[TensorRT-LLM][WARNING] Device 3 peer access Device 1 is not available.
[TensorRT-LLM][WARNING] Device 3 peer access Device 2 is not available.
[TensorRT-LLM][WARNING] Device 1 peer access Device 0 is not available.
[TensorRT-LLM][WARNING] Device 1 peer access Device 2 is not available.
[TensorRT-LLM][WARNING] Device 1 peer access Device 3 is not available.
[TensorRT-LLM][WARNING] Device 2 peer access Device 0 is not available.
[TensorRT-LLM][WARNING] Device 2 peer access Device 1 is not available.
[TensorRT-LLM][WARNING] Device 2 peer access Device 3 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 2 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 3 is not available.
[TensorRT-LLM][INFO] Loaded engine size: 681 MiB
[TensorRT-LLM][INFO] Loaded engine size: 681 MiB
[TensorRT-LLM][INFO] Loaded engine size: 681 MiB
[TensorRT-LLM][INFO] Loaded engine size: 681 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 16961.00 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 16961.00 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 16961.00 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 16961.00 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 677 (MiB)
[TensorRT-LLM][INFO] TRTEncoderModel mMaxInputLen: reset to 1024 from build config.
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 256
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 256
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 1024
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1024
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 16384
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 677 (MiB)
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 1 = max_input_len (in trtllm-build args)
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 677 (MiB)
[TensorRT-LLM][INFO] TRTEncoderModel mMaxInputLen: reset to 1024 from build config.
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 3
[TensorRT-LLM][INFO] Rank 3 is using GPU 3
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 256
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 256
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 1024
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1024
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 16384
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 1 = max_input_len (in trtllm-build args)
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.
[TensorRT-LLM][INFO] TRTEncoderModel mMaxInputLen: reset to 1024 from build config.
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 2
[TensorRT-LLM][INFO] Rank 2 is using GPU 2
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 256
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 256
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 1024
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1024
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 16384
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 1 = max_input_len (in trtllm-build args)
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 677 (MiB)
[TensorRT-LLM][INFO] TRTEncoderModel mMaxInputLen: reset to 1024 from build config.
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 1
[TensorRT-LLM][INFO] Rank 1 is using GPU 1
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 256
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 256
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 1024
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1024
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 16384
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 1 = max_input_len (in trtllm-build args)
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.
[TensorRT-LLM][INFO] Loaded engine size: 922 MiB
[TensorRT-LLM][INFO] Loaded engine size: 922 MiB
[TensorRT-LLM][INFO] Loaded engine size: 922 MiB
[TensorRT-LLM][INFO] Loaded engine size: 922 MiB
[TensorRT-LLM][INFO] Detecting local TP group for rank 0
[TensorRT-LLM][INFO] Detecting local TP group for rank 2
[TensorRT-LLM][INFO] Detecting local TP group for rank 3
[TensorRT-LLM][INFO] Detecting local TP group for rank 1
[TensorRT-LLM][INFO] TP group is intra-node for rank 1
[TensorRT-LLM][INFO] TP group is intra-node for rank 3
[TensorRT-LLM][INFO] TP group is intra-node for rank 2
[TensorRT-LLM][INFO] TP group is intra-node for rank 0
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1328.64 MiB for execution context memory.
[TensorRT-LLM][INFO] cudaDeviceCanAccessPeer failed for device: 2 peerDevice: 0
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1328.64 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1328.64 MiB for execution context memory.
[TensorRT-LLM][INFO] cudaDeviceCanAccessPeer failed for device: 0 peerDevice: 1
[TensorRT-LLM][INFO] cudaDeviceCanAccessPeer failed for device: 3 peerDevice: 0
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1328.64 MiB for execution context memory.
[TensorRT-LLM][INFO] cudaDeviceCanAccessPeer failed for device: 1 peerDevice: 0
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 1594 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 1594 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 1594 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 1594 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.04 GB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.04 GB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.04 GB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.04 GB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 176.93 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 22.19 GiB, available: 0.21 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 176.93 MB GPU memory for decoder.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 176.93 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 22.19 GiB, available: 0.21 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 176.93 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 22.19 GiB, available: 0.21 GiB
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 22.19 GiB, available: 0.21 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 33
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 16
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 33
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 16
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 33
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 33
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 16
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 16
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.10 GiB for max tokens in paged KV cache (2112).
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.10 GiB for max tokens in paged KV cache (2112).
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.10 GiB for max tokens in paged KV cache (2112).
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 22.19 GiB, available: 0.11 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 18
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 16
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.10 GiB for max tokens in paged KV cache (2112).
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 22.19 GiB, available: 0.11 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 18
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 16
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 22.19 GiB, available: 0.11 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 18
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 16
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 22.19 GiB, available: 0.11 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 18
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 16
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.05 GiB for max tokens in paged KV cache (1152).
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.05 GiB for max tokens in paged KV cache (1152).
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.05 GiB for max tokens in paged KV cache (1152).
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.05 GiB for max tokens in paged KV cache (1152).
[11/06/2024-17:35:04] [TRT-LLM] [I] Load engine takes: 2.3786461353302 sec
[11/06/2024-17:35:04] [TRT-LLM] [I] Load engine takes: 2.377972364425659 sec
[11/06/2024-17:35:04] [TRT-LLM] [I] Load engine takes: 2.3568196296691895 sec
[11/06/2024-17:35:04] [TRT-LLM] [I] Load engine takes: 2.362577199935913 sec

NVIDIA / TensorRT-LLM

T5 out of memory #2398