NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.42k stars 950 forks source link

trtllm-build is calling for excessive memory resulting in OOM error #1543

Closed regexboi closed 5 months ago

regexboi commented 5 months ago

System Info

Trying to build llama3-8b-instruct and mistral-instruct-0.2, both are resulting in OOM errors, but looking at the memory called for it seems too large:

llama

trtllm-build --checkpoint_dir /app/llama-trt --output_dir /app/llama-trt-engine
[TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024043000
[05/05/2024-08:53:48] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[05/05/2024-08:53:48] [TRT-LLM] [I] Set gpt_attention_plugin to float16.
[05/05/2024-08:53:48] [TRT-LLM] [I] Set gemm_plugin to None.
[05/05/2024-08:53:48] [TRT-LLM] [I] Set nccl_plugin to float16.
[05/05/2024-08:53:48] [TRT-LLM] [I] Set lookup_plugin to None.
[05/05/2024-08:53:48] [TRT-LLM] [I] Set lora_plugin to None.
[05/05/2024-08:53:48] [TRT-LLM] [I] Set moe_plugin to float16.
[05/05/2024-08:53:48] [TRT-LLM] [I] Set mamba_conv1d_plugin to float16.
[05/05/2024-08:53:48] [TRT-LLM] [I] Set context_fmha to True.
[05/05/2024-08:53:48] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[05/05/2024-08:53:48] [TRT-LLM] [I] Set paged_kv_cache to True.
[05/05/2024-08:53:48] [TRT-LLM] [I] Set remove_input_padding to True.
[05/05/2024-08:53:48] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[05/05/2024-08:53:48] [TRT-LLM] [I] Set multi_block_mode to False.
[05/05/2024-08:53:48] [TRT-LLM] [I] Set enable_xqa to True.
[05/05/2024-08:53:48] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[05/05/2024-08:53:48] [TRT-LLM] [I] Set tokens_per_block to 128.
[05/05/2024-08:53:48] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[05/05/2024-08:53:48] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[05/05/2024-08:53:48] [TRT-LLM] [I] Set use_context_fmha_for_generation to False.
[05/05/2024-08:53:48] [TRT-LLM] [I] Set multiple_profiles to False.
[05/05/2024-08:53:48] [TRT-LLM] [I] Set paged_state to True.
[05/05/2024-08:53:48] [TRT-LLM] [I] Set streamingllm to False.
[05/05/2024-08:53:48] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_size*max_input_len.
It may not be optimal to set max_num_tokens=max_batch_size*max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.
[05/05/2024-08:53:48] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.

[05/05/2024-08:53:49] [TRT] [I] [MemUsageChange] Init CUDA: CPU +15, GPU +0, now: CPU 391, GPU 72000 (MiB)
[05/05/2024-08:53:52] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1985, GPU +350, now: CPU 2512, GPU 72350 (MiB)
[05/05/2024-08:53:52] [TRT-LLM] [I] Set nccl_plugin to None.
[05/05/2024-08:53:52] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[05/05/2024-08:53:52] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/vocab_embedding/GATHER_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
...
[05/05/2024-08:53:52] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/ln_f/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/ln_f/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[05/05/2024-08:53:52] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[05/05/2024-08:53:52] [TRT] [W] Unused Input: position_ids
[05/05/2024-08:53:52] [TRT] [W] Detected layernorm nodes in FP16.
[05/05/2024-08:53:52] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy.
[05/05/2024-08:53:52] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[05/05/2024-08:53:52] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2561, GPU 72376 (MiB)
[05/05/2024-08:53:52] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 2562, GPU 72386 (MiB)
[05/05/2024-08:53:52] [TRT] [W] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
[05/05/2024-08:53:52] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[05/05/2024-08:59:54] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[05/05/2024-08:59:54] [TRT] [I] Detected 14 inputs and 1 output network tensors.
[05/05/2024-08:59:54] [TRT] [I] Total Host Persistent Memory: 26144
[05/05/2024-08:59:54] [TRT] [I] Total Device Persistent Memory: 0
[05/05/2024-08:59:54] [TRT] [I] Total Scratch Memory: 83886080
[05/05/2024-08:59:54] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 205 steps to complete.
[05/05/2024-08:59:54] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 8.30626ms to assign 17 blocks to 205 nodes requiring 121640960 bytes.
[05/05/2024-08:59:54] [TRT] [I] Total Activation Memory: 121639424
[05/05/2024-08:59:54] [TRT] [E] 2: [virtualMemoryBuffer.cpp::resizePhysical::140] Error Code 2: OutOfMemory (no further information)
[05/05/2024-08:59:54] [TRT] [E] 2: [virtualMemoryBuffer.cpp::resizePhysical::140] Error Code 2: OutOfMemory (no further information)
[05/05/2024-08:59:54] [TRT] [W] Requested amount of GPU memory (16065232896 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
[05/05/2024-08:59:55] [TRT] [E] 2:
[05/05/2024-08:59:55] [TRT] [E] 2: [globWriter.cpp::makeResizableGpuMemory::423] Error Code 2: OutOfMemory (no further information)
[05/05/2024-08:59:55] [TRT-LLM] [E] Engine building failed, please check the error log.
[05/05/2024-08:59:55] [TRT] [I] Serialized 1859 bytes of code generator cache.
[05/05/2024-08:59:55] [TRT] [I] Serialized 625744 bytes of compilation cache.
[05/05/2024-08:59:55] [TRT] [I] Serialized 7 timing cache entries
[05/05/2024-08:59:55] [TRT-LLM] [I] Timing cache serialized to model.cache
[05/05/2024-08:59:55] [TRT-LLM] [I] Total time of building all engines: 00:06:06

Mistral

trtllm-build --checkpoint_dir /app/mistral-trt --output_dir /app/mistral-trt-engine
[TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024043000
[05/05/2024-09:10:33] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[05/05/2024-09:10:33] [TRT-LLM] [I] Set gpt_attention_plugin to float16.
[05/05/2024-09:10:33] [TRT-LLM] [I] Set gemm_plugin to None.
[05/05/2024-09:10:33] [TRT-LLM] [I] Set nccl_plugin to float16.
[05/05/2024-09:10:33] [TRT-LLM] [I] Set lookup_plugin to None.
[05/05/2024-09:10:33] [TRT-LLM] [I] Set lora_plugin to None.
[05/05/2024-09:10:33] [TRT-LLM] [I] Set moe_plugin to float16.
[05/05/2024-09:10:33] [TRT-LLM] [I] Set mamba_conv1d_plugin to float16.
[05/05/2024-09:10:33] [TRT-LLM] [I] Set context_fmha to True.
[05/05/2024-09:10:33] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[05/05/2024-09:10:33] [TRT-LLM] [I] Set paged_kv_cache to True.
[05/05/2024-09:10:33] [TRT-LLM] [I] Set remove_input_padding to True.
[05/05/2024-09:10:33] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[05/05/2024-09:10:33] [TRT-LLM] [I] Set multi_block_mode to False.
[05/05/2024-09:10:33] [TRT-LLM] [I] Set enable_xqa to True.
[05/05/2024-09:10:33] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[05/05/2024-09:10:33] [TRT-LLM] [I] Set tokens_per_block to 128.
[05/05/2024-09:10:33] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[05/05/2024-09:10:33] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[05/05/2024-09:10:33] [TRT-LLM] [I] Set use_context_fmha_for_generation to False.
[05/05/2024-09:10:33] [TRT-LLM] [I] Set multiple_profiles to False.
[05/05/2024-09:10:33] [TRT-LLM] [I] Set paged_state to True.
[05/05/2024-09:10:33] [TRT-LLM] [I] Set streamingllm to False.
[05/05/2024-09:10:33] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_size*max_input_len.
It may not be optimal to set max_num_tokens=max_batch_size*max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.
[05/05/2024-09:10:33] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.

[05/05/2024-09:10:34] [TRT] [I] [MemUsageChange] Init CUDA: CPU +15, GPU +0, now: CPU 1159, GPU 72000 (MiB)
[05/05/2024-09:10:37] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1985, GPU +350, now: CPU 3280, GPU 72350 (MiB)
[05/05/2024-09:10:37] [TRT-LLM] [I] Set nccl_plugin to None.
[05/05/2024-09:10:37] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[05/05/2024-09:10:37] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/vocab_embedding/GATHER_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
...
[05/05/2024-09:10:37] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/ln_f/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/ln_f/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[05/05/2024-09:10:37] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[05/05/2024-09:10:37] [TRT] [W] Unused Input: position_ids
[05/05/2024-09:10:37] [TRT] [W] Detected layernorm nodes in FP16.
[05/05/2024-09:10:37] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy.
[05/05/2024-09:10:37] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[05/05/2024-09:10:37] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 3329, GPU 72376 (MiB)
[05/05/2024-09:10:37] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 3330, GPU 72386 (MiB)
[05/05/2024-09:10:37] [TRT] [W] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
[05/05/2024-09:10:37] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[05/05/2024-09:15:37] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[05/05/2024-09:15:37] [TRT] [I] Detected 14 inputs and 1 output network tensors.
[05/05/2024-09:15:38] [TRT] [I] Total Host Persistent Memory: 26144
[05/05/2024-09:15:38] [TRT] [I] Total Device Persistent Memory: 0
[05/05/2024-09:15:38] [TRT] [I] Total Scratch Memory: 83886080
[05/05/2024-09:15:38] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 205 steps to complete.
[05/05/2024-09:15:38] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 8.5196ms to assign 17 blocks to 205 nodes requiring 121640960 bytes.
[05/05/2024-09:15:38] [TRT] [I] Total Activation Memory: 121639424
[05/05/2024-09:15:38] [TRT] [E] 2: [virtualMemoryBuffer.cpp::resizePhysical::140] Error Code 2: OutOfMemory (no further information)
[05/05/2024-09:15:38] [TRT] [E] 2: [virtualMemoryBuffer.cpp::resizePhysical::140] Error Code 2: OutOfMemory (no further information)
[05/05/2024-09:15:38] [TRT] [W] Requested amount of GPU memory (14500757504 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
[05/05/2024-09:15:38] [TRT] [E] 2:
[05/05/2024-09:15:38] [TRT] [E] 2: [globWriter.cpp::makeResizableGpuMemory::423] Error Code 2: OutOfMemory (no further information)
[05/05/2024-09:15:38] [TRT-LLM] [E] Engine building failed, please check the error log.
[05/05/2024-09:15:38] [TRT] [I] Serialized 1919 bytes of code generator cache.
[05/05/2024-09:15:39] [TRT] [I] Serialized 696621 bytes of compilation cache.
[05/05/2024-09:15:39] [TRT] [I] Serialized 7 timing cache entries
[05/05/2024-09:15:39] [TRT-LLM] [I] Timing cache serialized to model.cache
[05/05/2024-09:15:39] [TRT-LLM] [I] Total time of building all engines: 00:05:05

Who can help?

No response

Information

Tasks

Reproduction

docker run --gpus 1 --mount type=bind,source=/home/user/.cache/huggingface/hub,target=/app/model,bind-propagation=rshared -it trt-llm

cd TensorRT-LLM/examples/llama/

python3 convert_checkpoint.py --model_dir /app/model/models--mistralai--Mistral-7B-Instruct-v0.2/snapshots/41b61a33a2483885c981aa79e0df6b32407ed873/ --output_dir /app/mistral-trt --dtype float16 --load_by_shard

trtllm-build --checkpoint_dir /app/mistral-trt --output_dir /app/mistral-trt-engine
# dockerfile for trt-llm
# Use the NVIDIA CUDA base image with development tools installed
FROM nvidia/cuda:12.1.0-devel-ubuntu22.04

# Set the working directory
WORKDIR /app

# Update and install necessary packages
RUN apt-get update && \
    apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev git && \
    rm -rf /var/lib/apt/lists/*

# Install tensorrt_llm
RUN pip3 install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com

# Clone the TensorRT-LLM repository and install dependencies
RUN git clone https://github.com/NVIDIA/TensorRT-LLM.git && \
    cd TensorRT-LLM/examples/llama && \
    pip3 install -r requirements.txt

# Set the entrypoint to /bin/bash to match your setup
ENTRYPOINT ["/bin/bash"]

Expected behavior

Successful build

actual behavior

Out of memory error

additional notes

Could docker be introducing the issues? Or perhaps my attempt to use all default build options, I looked through the docs and it seemed like the defaults would be the best other than --max_input_len but I tried setting that to 4096 for llama and it didnt change the memory allocation at all.

byshiue commented 5 months ago

Could you try larger --shm-size when you launch the docker? Like --shm-size 25g.

regexboi commented 5 months ago

Amazing, that worked thank you so much! I used --shm-size 25g as recommended above and it worked.

teis-e commented 5 months ago

@regexboi could you share the checkpoint command you used?

regexboi commented 5 months ago

@regexboi could you share the checkpoint command you used?

Sure, here are all the commands I used, these are the final working ones:

python3 convert_checkpoint.py --model_dir /app/model/models--mistralai--Mistral-7B-Instruct-v0.2/snapshots/41b61a33a2483885c981aa79e0df6b32407ed873/ --output_dir /app/model/mistral-trt --dtype float16 --load_by_shard

trtllm-build --checkpoint_dir /app/model/mistral-trt --output_dir /app/model/mistral-trt-engine --gpt_attention_plugin float16 --gemm_plugin float16 --max_input_len 32256