Open ydm-amazon opened 5 days ago
@ydm-amazon Which convert_checkpoint.py do you use?
@hello-11 Thanks for the quick response - I use the example one in examples/enc_dec
.
Some more context, we specifically noticed that the gpu memory usage is about 2x higher with v12 compared to v11. I believe @ydm-amazon is observing something similar for v13, but i'll check and confirm. I reduced the batch size from 256 to 128, and with that we can successfully deploy, but memory is much higher.
For v11 vs v12 difference, i'm not sure if this behavior change is expected, or whether we are missing a configuration.
In v12, these are the commands we used to build the engine
# encoder
trtllm-build
--checkpoint_dir /tmp/trtllm_t5_ckpt/encoder/
--log_level info
--gemm_plugin bfloat16
--output_dir /tmp/trtengine/google-flan-t5-xl/1/encoder
--workers 4
--gpt_attention_plugin bfloat16
--paged_kv_cache disable
--context_fmha disable
--max_beam_width 1
--remove_input_padding enable
--use_paged_context_fmha disable
--use_fp8_context_fmha disable
--max_batch_size 256
--max_input_len 1024
--max_num_tokens 16384
--enable_xqa disable
--moe_plugin disable
# decoder
trtllm-build
--checkpoint_dir /tmp/trtllm_t5_ckpt/decoder/
--log_level info
--gemm_plugin bfloat16
--output_dir /tmp/trtengine/google-flan-t5-xl/1/decoder
--workers 4
--gpt_attention_plugin bfloat16
--paged_kv_cache enable
--context_fmha disable
--max_beam_width 1
--remove_input_padding enable
--use_paged_context_fmha disable
--use_fp8_context_fmha disable
--max_batch_size 128
--max_input_len 1
--max_num_tokens 16384
--enable_xqa disable
--moe_plugin disable
--max_encoder_input_len 1024
--max_seq_len 1024
In v11 we used:
# encoder
trtllm-build
--tp_size 4
--pp_size 1
--checkpoint_dir /tmp/trtllm_t5_ckpt/encoder/
--log_level info
--gemm_plugin bfloat16
--output_dir /tmp/trtengine/google-flan-t5-xl/1/encoder
--workers 4
--gpt_attention_plugin bfloat16
--paged_kv_cache disable
--context_fmha disable
--max_beam_width 1
--remove_input_padding enable
--use_custom_all_reduce disable
--use_paged_context_fmha disable
--use_fp8_context_fmha disable
--max_batch_size 256
--max_input_len 1024
--max_num_tokens 16384
--enable_xqa disable
--moe_plugin disable
# decoder
trtllm-build
--tp_size 4
--pp_size 1
--checkpoint_dir /tmp/trtllm_t5_ckpt/decoder/
--log_level info
--gemm_plugin bfloat16
--output_dir /tmp/trtengine/google-flan-t5-xl/1/decoder
--workers 4
--gpt_attention_plugin bfloat16
--paged_kv_cache enable
--context_fmha disable
--max_beam_width 1
--remove_input_padding enable
--use_custom_all_reduce disable
--use_paged_context_fmha disable
--use_fp8_context_fmha disable
--max_batch_size 256
--max_input_len 1
--max_num_tokens 16384
--enable_xqa disable
--moe_plugin disable
--max_encoder_input_len 1024
--max_seq_len 1024
In both cases, we're using tp4 deploying on L4 gpus.
v12 output from nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05 Driver Version: 550.127.05 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L4 On | 00000000:38:00.0 Off | 0 |
| N/A 34C P0 27W / 72W | 13426MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA L4 On | 00000000:3A:00.0 Off | 0 |
| N/A 30C P0 27W / 72W | 13426MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA L4 On | 00000000:3C:00.0 Off | 0 |
| N/A 33C P0 26W / 72W | 13426MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA L4 On | 00000000:3E:00.0 Off | 0 |
| N/A 32C P0 27W / 72W | 13426MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 133162 C python3 12650MiB |
| 0 N/A N/A 133163 C python3 248MiB |
| 0 N/A N/A 133164 C python3 248MiB |
| 0 N/A N/A 133165 C python3 248MiB |
| 1 N/A N/A 133162 C python3 248MiB |
| 1 N/A N/A 133163 C python3 12650MiB |
| 1 N/A N/A 133164 C python3 248MiB |
| 1 N/A N/A 133165 C python3 248MiB |
| 2 N/A N/A 133162 C python3 248MiB |
| 2 N/A N/A 133163 C python3 248MiB |
| 2 N/A N/A 133164 C python3 12650MiB |
| 2 N/A N/A 133165 C python3 248MiB |
| 3 N/A N/A 133162 C python3 248MiB |
| 3 N/A N/A 133163 C python3 248MiB |
| 3 N/A N/A 133164 C python3 248MiB |
| 3 N/A N/A 133165 C python3 12650MiB |
+-----------------------------------------------------------------------------------------+
v11 output from nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05 Driver Version: 550.127.05 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L4 On | 00000000:38:00.0 Off | 0 |
| N/A 38C P0 27W / 72W | 7086MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA L4 On | 00000000:3A:00.0 Off | 0 |
| N/A 35C P0 26W / 72W | 7086MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA L4 On | 00000000:3C:00.0 Off | 0 |
| N/A 37C P0 26W / 72W | 7086MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA L4 On | 00000000:3E:00.0 Off | 0 |
| N/A 37C P0 27W / 72W | 7086MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 151463 C python3 6314MiB |
| 0 N/A N/A 151464 C python3 248MiB |
| 0 N/A N/A 151465 C python3 248MiB |
| 0 N/A N/A 151466 C python3 248MiB |
| 1 N/A N/A 151463 C python3 248MiB |
| 1 N/A N/A 151464 C python3 6314MiB |
| 1 N/A N/A 151465 C python3 248MiB |
| 1 N/A N/A 151466 C python3 248MiB |
| 2 N/A N/A 151463 C python3 248MiB |
| 2 N/A N/A 151464 C python3 248MiB |
| 2 N/A N/A 151465 C python3 6314MiB |
| 2 N/A N/A 151466 C python3 248MiB |
| 3 N/A N/A 151463 C python3 248MiB |
| 3 N/A N/A 151464 C python3 248MiB |
| 3 N/A N/A 151465 C python3 248MiB |
| 3 N/A N/A 151466 C python3 6314MiB |
+-----------------------------------------------------------------------------------------+
Hi @ydm-amazon @siddvenk,
I cannot get access to 4xL4, but I tried with 4xA30, I cannot reproduce your error on either TRT-LLM v0.12 or main branch. There is tiny memory size diff between 4xL4 (23034MiB) and 4xA30 (24576MiB) but this is the closest I can have.
Currently the usage of the gpu memory is determined by --kv_cache_free_gpu_memory_fraction
argument in run.py. The memory usage different might be a change of design in this feature. For enc-dec, it's more complicated, it's the value but divided by 2 (for enc and dec). See code here for reference.
I'm assuming you're using Tensorrt_LLM_backend to do inferencing. Please also check the value of kv_cache_free_gpu_memory setting in config.pbtxt.
I've attached my nvidia-smi running the engine.
Tue Nov 5 07:11:22 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01 Driver Version: 565.57.01 CUDA Version: 12.7 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A30 On | 00000000:81:00.0 Off | 0 |
| N/A 30C P0 30W / 165W | 23532MiB / 24576MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A30 On | 00000000:A1:00.0 Off | 0 |
| N/A 31C P0 35W / 165W | 23532MiB / 24576MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA A30 On | 00000000:C1:00.0 Off | 0 |
| N/A 31C P0 32W / 165W | 23532MiB / 24576MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA A30 On | 00000000:E1:00.0 Off | 0 |
| N/A 30C P0 31W / 165W | 23532MiB / 24576MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2995 C python3 22828MiB |
| 0 N/A N/A 2996 C python3 224MiB |
| 0 N/A N/A 2997 C python3 224MiB |
| 0 N/A N/A 2998 C python3 224MiB |
| 1 N/A N/A 2995 C python3 224MiB |
| 1 N/A N/A 2996 C python3 22828MiB |
| 1 N/A N/A 2997 C python3 224MiB |
| 1 N/A N/A 2998 C python3 224MiB |
| 2 N/A N/A 2995 C python3 224MiB |
| 2 N/A N/A 2996 C python3 224MiB |
| 2 N/A N/A 2997 C python3 22828MiB |
| 2 N/A N/A 2998 C python3 224MiB |
| 3 N/A N/A 2995 C python3 224MiB |
| 3 N/A N/A 2996 C python3 224MiB |
| 3 N/A N/A 2997 C python3 224MiB |
| 3 N/A N/A 2998 C python3 22828MiB |
+-----------------------------------------------------------------------------------------+
Hi @jtchen0528,
Setting the kv_cache_free_gpu_memory fraction does not seem to fix the issue - I have tried values 0.9, 0.8, and 0.4 but they all give an OOM error.
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 22.19 GiB of which 1.50 MiB is free. Process 885415 has 22.18 GiB memory in use. Of the allocated memory 0 bytes is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
I have also double checked the config.pbtxt file in the artifacts to make sure it's set correctly.
``Hi @ydm-amazon,
I see that you managed to reproduce the issue with TensorRT-LLM only. Let's focus on debugging TensorRT-LLM for now.
My step:
tags/v0.12.0
here, run.py
Since the examples/run.py
script does not stop after launching the executor, so I put a time.sleep(60)
before model.generate()
to log the gpu allocation with nvidia-smi
. Saw a 23532MiB / 24576MiB
for each GPU since the allocation is based on the kv_cache_free_gpu_memory_fraction parameter. You should see a similar number with your L4 23034MiB.
Another possible cause is, I wonder if you set your mpi setting correctly when allocating nodes. I see the following log messages in the log you uploaded:
[TensorRT-LLM][WARNING] Device 3 peer access Device 0 is not available.
Setting the gpu id with CUDA_VISIBLE_DEVICE=0,1,2,3
before run.py, and try echo $CUDA_VISIBLE_DEVICES
to ensure that the correct GPU is running.
Also check the number of task per nodes when allocating nodes.
I put the time.sleep(60) before the model.generate(), and this is the nvidia-smi:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A10G On | 00000000:00:1B.0 Off | 0 |
| 0% 20C P0 53W / 300W | 22684MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A10G On | 00000000:00:1C.0 Off | 0 |
| 0% 20C P0 56W / 300W | 22684MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A10G On | 00000000:00:1D.0 Off | 0 |
| 0% 21C P0 56W / 300W | 22684MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 |
| 0% 21C P0 56W / 300W | 22684MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 892976 C python3 22676MiB |
| 1 N/A N/A 892977 C python3 22676MiB |
| 2 N/A N/A 892978 C python3 22676MiB |
| 3 N/A N/A 892979 C python3 22676MiB |
+---------------------------------------------------------------------------------------+
It seems that there isn't enough space to hold the three other processes per gpu when generate starts. In your memory usage output it is 22828 MiB + 224 MiB 3, but for mine, 22676 MiB + 224 MiB 3 = 23348 MiB which is larger than the 23028 MiB the GPU can hold.
For CUDA_VISIBLE_DEVICES, it seems that setting it does not prevent the "...peer access ... is not available" warning. Below I have the log up till the point of time.sleep(60), in case it helps:
root@9fcc5d585e64:/opt/ml/model# CUDA_VISIBLE_DEVICES=0,1,2,3
root@9fcc5d585e64:/opt/ml/model# echo $CUDA_VISIBLE_DEVICES
0,1,2,3
root@9fcc5d585e64:/opt/ml/model# mpirun --allow-run-as-root -np 4 python3 run.py --engine_dir /tmp/.djl.ai/trtllm/1d7f7fa222cc2d5726132bb27e06c64d65b384ec/google-flan-t5-xl/1/ --tokenizer_dir google/flan-t5-xl --max_output_len 64 --num_beams=1 --input_text "translate English to German: The house is wonderful."
[TensorRT-LLM] TensorRT-LLM version: 0.12.0
[TensorRT-LLM] TensorRT-LLM version: 0.12.0
[TensorRT-LLM] TensorRT-LLM version: 0.12.0
[TensorRT-LLM] TensorRT-LLM version: 0.12.0
[11/06/2024-17:35:01] [TRT-LLM] [W] This path is an encoder-decoder model. Using different handling.
[11/06/2024-17:35:01] [TRT-LLM] [W] This path is an encoder-decoder model. Using different handling.
[11/06/2024-17:35:01] [TRT-LLM] [W] This path is an encoder-decoder model. Using different handling.
[11/06/2024-17:35:01] [TRT-LLM] [W] This path is an encoder-decoder model. Using different handling.
/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 1
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 3
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 2
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 0
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 256
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 256
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 1024
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1024
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 16384
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 256
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 256
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 1024
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1024
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 16384
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 1024 = max_input_len (in trtllm-build args)
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 2
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 256
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 256
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 1024
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1024
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 1024 = max_input_len (in trtllm-build args)
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 3
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 256
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 256
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 1024
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1024
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 16384
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 16384
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 1024 = max_input_len (in trtllm-build args)
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 1024 = max_input_len (in trtllm-build args)
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][WARNING] Device 3 peer access Device 0 is not available.
[TensorRT-LLM][WARNING] Device 3 peer access Device 1 is not available.
[TensorRT-LLM][WARNING] Device 3 peer access Device 2 is not available.
[TensorRT-LLM][WARNING] Device 1 peer access Device 0 is not available.
[TensorRT-LLM][WARNING] Device 1 peer access Device 2 is not available.
[TensorRT-LLM][WARNING] Device 1 peer access Device 3 is not available.
[TensorRT-LLM][WARNING] Device 2 peer access Device 0 is not available.
[TensorRT-LLM][WARNING] Device 2 peer access Device 1 is not available.
[TensorRT-LLM][WARNING] Device 2 peer access Device 3 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 2 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 3 is not available.
[TensorRT-LLM][INFO] Loaded engine size: 681 MiB
[TensorRT-LLM][INFO] Loaded engine size: 681 MiB
[TensorRT-LLM][INFO] Loaded engine size: 681 MiB
[TensorRT-LLM][INFO] Loaded engine size: 681 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 16961.00 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 16961.00 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 16961.00 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 16961.00 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 677 (MiB)
[TensorRT-LLM][INFO] TRTEncoderModel mMaxInputLen: reset to 1024 from build config.
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 256
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 256
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 1024
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1024
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 16384
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 677 (MiB)
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 1 = max_input_len (in trtllm-build args)
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 677 (MiB)
[TensorRT-LLM][INFO] TRTEncoderModel mMaxInputLen: reset to 1024 from build config.
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 3
[TensorRT-LLM][INFO] Rank 3 is using GPU 3
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 256
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 256
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 1024
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1024
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 16384
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 1 = max_input_len (in trtllm-build args)
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.
[TensorRT-LLM][INFO] TRTEncoderModel mMaxInputLen: reset to 1024 from build config.
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 2
[TensorRT-LLM][INFO] Rank 2 is using GPU 2
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 256
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 256
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 1024
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1024
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 16384
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 1 = max_input_len (in trtllm-build args)
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 677 (MiB)
[TensorRT-LLM][INFO] TRTEncoderModel mMaxInputLen: reset to 1024 from build config.
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 1
[TensorRT-LLM][INFO] Rank 1 is using GPU 1
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 256
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 256
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 1024
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1024
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 16384
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 1 = max_input_len (in trtllm-build args)
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.
[TensorRT-LLM][INFO] Loaded engine size: 922 MiB
[TensorRT-LLM][INFO] Loaded engine size: 922 MiB
[TensorRT-LLM][INFO] Loaded engine size: 922 MiB
[TensorRT-LLM][INFO] Loaded engine size: 922 MiB
[TensorRT-LLM][INFO] Detecting local TP group for rank 0
[TensorRT-LLM][INFO] Detecting local TP group for rank 2
[TensorRT-LLM][INFO] Detecting local TP group for rank 3
[TensorRT-LLM][INFO] Detecting local TP group for rank 1
[TensorRT-LLM][INFO] TP group is intra-node for rank 1
[TensorRT-LLM][INFO] TP group is intra-node for rank 3
[TensorRT-LLM][INFO] TP group is intra-node for rank 2
[TensorRT-LLM][INFO] TP group is intra-node for rank 0
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1328.64 MiB for execution context memory.
[TensorRT-LLM][INFO] cudaDeviceCanAccessPeer failed for device: 2 peerDevice: 0
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1328.64 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1328.64 MiB for execution context memory.
[TensorRT-LLM][INFO] cudaDeviceCanAccessPeer failed for device: 0 peerDevice: 1
[TensorRT-LLM][INFO] cudaDeviceCanAccessPeer failed for device: 3 peerDevice: 0
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1328.64 MiB for execution context memory.
[TensorRT-LLM][INFO] cudaDeviceCanAccessPeer failed for device: 1 peerDevice: 0
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 1594 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 1594 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 1594 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 1594 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.04 GB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.04 GB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.04 GB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.04 GB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 176.93 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 22.19 GiB, available: 0.21 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 176.93 MB GPU memory for decoder.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 176.93 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 22.19 GiB, available: 0.21 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 176.93 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 22.19 GiB, available: 0.21 GiB
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 22.19 GiB, available: 0.21 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 33
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 16
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 33
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 16
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 33
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 33
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 16
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 16
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.10 GiB for max tokens in paged KV cache (2112).
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.10 GiB for max tokens in paged KV cache (2112).
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.10 GiB for max tokens in paged KV cache (2112).
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 22.19 GiB, available: 0.11 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 18
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 16
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.10 GiB for max tokens in paged KV cache (2112).
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 22.19 GiB, available: 0.11 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 18
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 16
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 22.19 GiB, available: 0.11 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 18
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 16
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 22.19 GiB, available: 0.11 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 18
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 16
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.05 GiB for max tokens in paged KV cache (1152).
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.05 GiB for max tokens in paged KV cache (1152).
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.05 GiB for max tokens in paged KV cache (1152).
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.05 GiB for max tokens in paged KV cache (1152).
[11/06/2024-17:35:04] [TRT-LLM] [I] Load engine takes: 2.3786461353302 sec
[11/06/2024-17:35:04] [TRT-LLM] [I] Load engine takes: 2.377972364425659 sec
[11/06/2024-17:35:04] [TRT-LLM] [I] Load engine takes: 2.3568196296691895 sec
[11/06/2024-17:35:04] [TRT-LLM] [I] Load engine takes: 2.362577199935913 sec
System Info
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
When running Flan-T5-XL, I get the following error:
In previous versions (e.g. v0.11.0), this problem did not happen. But in v0.12.0, it does.
To reproduce:
Convert checkpoint:
Build engine:
Then runtime:
The error will be:
Expected behavior
Expected behavior is no OOM like v0.11.0. I tested with v0.11.0 with similar commands and the error did not happen (the only difference in the commands is following the breaking changes of v0.12, where tp_size and pp_size are set during checkpoint conversion, multi block mode during runtime, etc. I checked that all of these settings did not affect by also testing if adding
--multi_block_mode true --enable_context_fmha_fp32_acc
to run command would affect the result -- the OOM still happens.actual behavior
Actual behavior is the OOM as described above. Please see the log attached in the 'additional notes' section for more information
additional notes
Full log when running TRTLLM