Closed atyshka closed 10 months ago
It appears to be a memory issue. While TRT saves the engine for rank 0, memory consumption is at about 67 GB. Then I get OOM when rank 1 starts and memory consumption reaches 128GB (on the CPU). This could be "expected behavior" but I think it's strange that none of the memory seems to be freed after building the first engine. It seems the weights are getting loaded twice, that might be a bug.
Thanks for reporting that issue @atyshka - it’s indeed suspicious that we use that much memory. We’ll investigate.
Thanks for reporting that issue @atyshka - it’s indeed suspicious that we use that much memory. We’ll investigate.
Thanks! For now I'll investigate if pre-quantizing to 8 bit will help
Same issue in building engine prograss for multiple world_size. BTW, It seems build the engine for different rank serially and just utilize a single GPU.
Is that possible that you could try load_from_binary
function? It should avoid the CPU OOM issue.
@haojiwei the build.py provides the flag --parallel_build
for parallelly building TensorRT engines
@wm2012011492 How can I use load_from_binary without quantization? The hf_llama_convert script doesn't seem to work without quantization:
AssertionError: Either INT8 kv cache or SmoothQuant must be enabled for this script. Otherwise you can directly build engines from HuggingFace checkpoints, no need to do this FT-format conversion.
@wm2012011492 How can I use load_from_binary without quantization? The hf_llama_convert script doesn't seem to work without quantization:
AssertionError: Either INT8 kv cache or SmoothQuant must be enabled for this script. Otherwise you can directly build engines from HuggingFace checkpoints, no need to do this FT-format conversion.
You can refer here for the usage of load_from_binary if you want.
@wm2012011492 How can I use load_from_binary without quantization? The hf_llama_convert script doesn't seem to work without quantization:
AssertionError: Either INT8 kv cache or SmoothQuant must be enabled for this script. Otherwise you can directly build engines from HuggingFace checkpoints, no need to do this FT-format conversion.
You can refer here for the usage of load_from_binary if you want.
Yes, but my question is how to convert the LLaMa checkpoints (in either original or HF format) to the binary file format to use load_from_binary
? It's referenced as "FT" format which I assume refers to the previous FasterTransformer library. As far as I know hf_llama_convert.py is the only way to covert checkpoints to binary files, but that script doesn't support conversion without quantization
Thanks for reporting the issue.
I guess there might exist some memory leak in weight loading. Can you please check if this works? Revise the split method by
def split(v, tp_size, idx, dim=0):
if tp_size == 1:
return v
if len(v.shape) == 1:
return np.ascontiguousarray(np.split(v, tp_size)[idx].copy())
else:
return np.ascontiguousarray(np.split(v, tp_size, axis=dim)[idx].copy())
(add .copy()
). In my test using HF checkpoint, it alleviated the memory issue.
Especially during splitting weights, numpy splices a tensor and a sliced tensor shares the memory buffer of the original tensor. So, returning without copying makes the buffer of a loaded weight be referenced, resulting the garbage collector can’t release those weights until the whole process ends. There is a related discussion in stackoverflow: https://stackoverflow.com/questions/50195197/reduce-memory-usage-when-slicing-numpy-arrays
Thanks for the suggestion @jaedeok-nvidia. Can you share the details of your model and hardware setup? I was originally using Meta checkpoints, and unfortunately your copy() patch didn't seem to resolve the issue, it still runs out of memory when loading the weights the second time around. When I try to use huggingface checkpoints instead, I can't even finish building the first engine without running out of my 128GB of CPU memory.
I tested on DGX A100 (which has a lot host memory). Instead, I measured the memory usage. maybe 128GB is too small to build an engine under the current workflow. I'm trying to revise the workflow of the engine build in order to reduce the host memory footprint.
In the meantime, can you please try in this way? Load HF model to GPUs instead of CPU, to do this we may need to update the HF checkpoint loading like this,
hf_llama = LlamaForCausalLM.from_pretrained(
args.model_dir,
device_map='auto',
torch_dtype="auto")
Currently still there is some mem leak in model loading, so this may help us to save some more host memory.
But, unfortunately it seems that it still exceeds 128GB host memory during engine build. If you fails, please try explicitly del engine
after serialization (here)
loading baichuan-13B also consumes much memory and be killed
Also try updating transformers - I found 4.34.1 allowed the model to at least serialize successfully.
Ok nope, managed to get one engine serialized out but every build after is back to OOM.
I'm using main branch commit: https://github.com/NVIDIA/TensorRT-LLM/commit/d8b408e6dcc1d45982a8b94399cd74b78f80befa
As it was noted to have improvements to help this problem in the discussion here: https://github.com/NVIDIA/TensorRT-LLM/discussions/153
I'm on 1xA100 (80GB) using https://huggingface.co/Phind/Phind-CodeLlama-34B-v2:
I've tried various derivatives of the following configuration as well as using weights only.
python ./tensorrt_llm/examples/llama/build.py \
--model_dir ./Phind/Phind-CodeLlama-34B-v2/ \
--output_dir ./Phind/Phind-CodeLlama-34B-v2-engine/ \
--dtype float16 \
--remove_input_padding \
--use_gpt_attention_plugin float16 \
--enable_context_fmha \
--use_gemm_plugin float16 \
--rotary_base 1000000 --vocab_size 32000 --world_size 1 --tp_size 1 \
--enable_context_fmha \
--use_parallel_embedding \
--use_inflight_batching \
--max_input_len 1024 \
--max_output_len 1024 \
--max_batch_size 8 \
--parallel_build \
--paged_kv_cache
Swapping from CPU to GPU for the Model and Heads as suggested above gives the following error:
device_map='auto' as well as device_map={"model": "cuda","lm_head": "cuda"}
[TRT] [E] 10: Could not find any implementation for node {ForeignNode[LLaMAForCausalLM/vocab_embedding/CONSTANT_0...LLaMAForCausalLM/layers/0/input_layernorm/ELEMENTWISE_PROD_0]}.
[TRT] [E] 10: [optimizer.cpp::computeCosts::4051] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[LLaMAForCausalLM/vocab_embedding/CONSTANT_0...LLaMAForCausalLM/layers
/0/input_layernorm/ELEMENTWISE_PROD_0]}.)
Please let me know if there's any other information / examples I should try to help diagnose. Are the other patches from 4 days ago still relevant given the changes from the commit above?
Hi @lynkz-matt-psaltis , I tried to reproduce but it works well on my end. The error message says that there is an unsupported op in TRT side. Can you please share your testing env? like TensorRT version? An older TRT had a similar issue.
We've tested on TRT 9.1.0.4. If you need to upgrade TRT, please use the installation script docker/common/install_tensorrt.sh.
Thanks @jaedeok-nvidia
I'm building from main (commit 4de32a86ae92bc49a7ec17c00ec2f2d03663c198) on nvcr.io/nvidia/tritonserver:23.10-py3 pip list shows: tensorrt 9.1.0.post12.dev4
Was your repro environment a single A100 80G? ~200Gig of RAM
I tested on DGX A100 (which has a lot host memory). Instead, I measured the memory usage. maybe 128GB is too small to build an engine under the current workflow. I'm trying to revise the workflow of the engine build in order to reduce the host memory footprint.
In the meantime, can you please try in this way? Load HF model to GPUs instead of CPU, to do this we may need to update the HF checkpoint loading like this,
hf_llama = LlamaForCausalLM.from_pretrained( args.model_dir, device_map='auto', torch_dtype="auto")
Currently still there is some mem leak in model loading, so this may help us to save some more host memory. But, unfortunately it seems that it still exceeds 128GB host memory during engine build. If you fails, please try explicitly
del engine
after serialization (here)
[11/02/2023-12:40:16] [TRT-LLM] [I] HF Baichuan v2_13b loaded. Total time: 00:04:33 [11/02/2023-12:40:16] [TRT-LLM] [I] Loading weights from HF Baichuan v2_13b... Traceback (most recent call last): File "/app/tensorrt_llm/examples/baichuan/build.py", line 478, in build(0, args) File "/app/tensorrt_llm/examples/baichuan/build.py", line 448, in build engine = build_rank_engine(builder, builder_config, engine_name, File "/app/tensorrt_llm/examples/baichuan/build.py", line 348, in build_rank_engine load_from_hf_baichuan(tensorrt_llm_baichuan, File "/app/tensorrt_llm/examples/baichuan/weight.py", line 63, in load_from_hf_baichuan v = torch_to_numpy(v.to(torch_dtype).detach().cpu()) NotImplementedError: Cannot copy out of meta tensor; no data! https://github.com/NVIDIA/TensorRT-LLM/issues/229#issuecomment-1791008168
@jaedeok-nvidia if I add the del
, it still consumes more than 128GB of memory, but I can manage to make it work putting 10G on a swapfile.
I'm going to keep this open though because there's still quite a ways to go when it comes to memory efficiency. For my 34B model, the weights should take 68 GB in fp16 format. Obviously the actual engine building process is going to use significant memory, but I wouldn't expect it to be double the model size. Additionally, I think you should only need 1/n of the weights at a time for a config of n tensor parallel engines. So each of my two engines should in theory only need 34GB of weights.
Hi @atyshka, the issue is due to the memory leak of the trt-llm model. Especially, the trt-llm model is not properly released after build_rank_engine()
method. We have almost identified and I believe we fix it soon.
Additionally, even though tp size > 1, we have to load "full weights" to split the weights from a pertrained checkpoint. However, I agree, we will optimize the memory footprint more. For instance, we can reduce the memory footprint more by loading shard-by-shard instead of loading the model fully, like falcon example does.
Hi @lynkz-matt-psaltis sorry for missing your reply. Hmm... it's weird. That issue is not about the memory usage. If you still have the issue, can you please create another issue to discuss it?
loading gpt2-13b model also consumes much memory and be killed. script:
python3 build.py --model_dir=./c-model-13b/gpt2/1-gpu --use_gpt_attention_plugin=float16 --dtype=float16 --use_gemm_plugin=float16 --remove_input_padding
hardware:
total used free shared buff/cache available
Mem: 503Gi 68Gi 73Gi 1.9Gi 361Gi 412Gi
Low: 503Gi 429Gi 73Gi
---
GPU-v100 32510MiB
error log:
[11/07/2023-08:45:41] [TRT] [W] Requested amount of GPU memory (33555480576 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
[11/07/2023-08:45:41] [TRT] [W] UNSUPPORTED_STATESkipping tactic 2 due to insufficient memory on requested size of 33555480576 detected for tactic 0x000000000000001e.
[11/07/2023-08:45:41] [TRT] [E] 2: [virtualMemoryBuffer.cpp::resizePhysical::140] Error Code 2: OutOfMemory (no further information)
[11/07/2023-08:45:41] [TRT] [E] 2: [virtualMemoryBuffer.cpp::resizePhysical::140] Error Code 2: OutOfMemory (no further information)
[11/07/2023-08:45:41] [TRT] [W] Requested amount of GPU memory (33555480576 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
[11/07/2023-08:45:41] [TRT] [W] UNSUPPORTED_STATESkipping tactic 3 due to insufficient memory on requested size of 33555480576 detected for tactic 0x000000000000001f.
[11/07/2023-08:45:43] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[11/07/2023-08:45:43] [TRT] [I] Detected 49 inputs and 41 output network tensors.
[11/07/2023-08:46:10] [TRT] [I] Total Host Persistent Memory: 71248
[11/07/2023-08:46:10] [TRT] [I] Total Device Persistent Memory: 0
[11/07/2023-08:46:10] [TRT] [I] Total Scratch Memory: 4608992384
[11/07/2023-08:46:10] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 650 steps to complete.
[11/07/2023-08:46:10] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 35.8922ms to assign 11 blocks to 650 nodes requiring 8279011840 bytes.
[11/07/2023-08:46:10] [TRT] [I] Total Activation Memory: 8279011840
[11/07/2023-08:46:11] [TRT] [I] Total Weights Memory: 26245963592
[11/07/2023-08:46:11] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 26432, GPU 26871 (MiB)
[11/07/2023-08:46:11] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 26433, GPU 26881 (MiB)
[11/07/2023-08:46:11] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 1065 MiB, GPU 25031 MiB
[11/07/2023-08:46:11] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +25031, now: CPU 0, GPU 25031 (MiB)
[11/07/2023-08:46:41] [TRT] [I] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 51556 MiB
[11/07/2023-08:46:41] [TRT-LLM] [I] Total time of building gpt_float16_tp1_rank0.engine: 00:01:47
[11/07/2023-08:46:41] [TRT-LLM] [I] Config saved to gpt_outputs/config.json.
[11/07/2023-08:46:41] [TRT-LLM] [I] Serializing engine to gpt_outputs/gpt_float16_tp1_rank0.engine...
Killed
I'd also like to chime in. Trying to build Llama2 70B engines in fp16 tp4 with less than 500GB host memory goes OOM before it can finish rank 3. Ignore red and yellow graphs, only the green (user) memory is relevant.
Here's the command I use on this RTX A6000 system:
python build.py --model_dir "$INDIR/hf-70B-chat/" \
--output_dir "$ENGINE_DIR" \
--remove_input_padding \
--max_input_len 4096 \
--max_output_len 4096 \
--max_batch_size 16 \
--dtype float16 \
--enable_context_fmha \
--use_gemm_plugin float16 \
--use_gpt_attention_plugin float16 \
--use_inflight_batching \
--paged_kv_cache \
--world_size 4 \
--tp_size 4 \
--pp_size 1
The issue is less pronounced when using int8:
And here's what the process looks like on DGX A100:
More memory, so it doesn't crash, but using this amount to build is just silly.
Hi all, sorry for your inconvenience in engine build.
In the last week, we have updated to main
branch for reducing the peak CPU memory footprint. Please use --load_by_shard
option in LLaMA / BLOOM / Falcon models to reduce the memory footprint. For LLaMA model, --load_by_shard
works for HF checkpoint only, so that please convert to HF first if you have Meta checkpoint (please refer the guide for the checkpoint conversion).
There were several root causes in this issue: Memory leak in engine build time / load the full HF models. We have fixed the memory leaks caused in engine build time as well as we allow to load the weights shard-by-shard (not full model), the memory footprint is now significantly reduced in engine build. Thanks for your valuable feedbacks.
load_from_binary
there is some error when use_weight_only
my version is 0.6.1
my build command is
python /app/tensorrt_llm/examples/llama/build.py --model_dir /app/hf_model --dtype float16 --remove_input_padding --use_gemm_plugin float16 --use_gpt_attention_plugin float16 --world_size 2 --tp_size 2 --enable_context_fmha --use_inflight_batching --paged_kv_cache --load_by_shard --use_weight_only --weight_only_precision int8 --output_dir /app/triton_model/tensorrt_llm/1
the errror is
Traceback (most recent call last):
File "/app/tensorrt_llm/examples/llama/build.py", line 839, in <module>
build(0, args)
File "/app/tensorrt_llm/examples/llama/build.py", line 783, in build
engine = build_rank_engine(builder, builder_config, engine_name,
File "/app/tensorrt_llm/examples/llama/build.py", line 631, in build_rank_engine
load_from_hf_checkpoint(tensorrt_llm_llama,
File "/app/tensorrt_llm/examples/llama/weight.py", line 522, in load_from_hf_checkpoint
param = split_v.transpose()
TypeError: transpose() received an invalid combination of arguments - got (), but expected one of:
* (int dim0, int dim1)
* (name dim0, name dim1)
load_from_binary
there is some error when
use_weight_only
my version is 0.6.1my build command is
python /app/tensorrt_llm/examples/llama/build.py --model_dir /app/hf_model --dtype float16 --remove_input_padding --use_gemm_plugin float16 --use_gpt_attention_plugin float16 --world_size 2 --tp_size 2 --enable_context_fmha --use_inflight_batching --paged_kv_cache --load_by_shard --use_weight_only --weight_only_precision int8 --output_dir /app/triton_model/tensorrt_llm/1
the errror is
Traceback (most recent call last): File "/app/tensorrt_llm/examples/llama/build.py", line 839, in <module> build(0, args) File "/app/tensorrt_llm/examples/llama/build.py", line 783, in build engine = build_rank_engine(builder, builder_config, engine_name, File "/app/tensorrt_llm/examples/llama/build.py", line 631, in build_rank_engine load_from_hf_checkpoint(tensorrt_llm_llama, File "/app/tensorrt_llm/examples/llama/weight.py", line 522, in load_from_hf_checkpoint param = split_v.transpose() TypeError: transpose() received an invalid combination of arguments - got (), but expected one of: * (int dim0, int dim1) * (name dim0, name dim1)
solve by changing split_v.transpose()
to
split_v.transpose(0,1).contiguous()
branch
but what can we do on release 0.5.0 branch, I still met this problem with branch 0.5.0 and 32G CPU memory
Hi all, sorry for your inconvenience in engine build.
In the last week, we have updated to
main
branch for reducing the peak CPU memory footprint. Please use--load_by_shard
option in LLaMA / BLOOM / Falcon models to reduce the memory footprint. For LLaMA model,--load_by_shard
works for HF checkpoint only, so that please convert to HF first if you have Meta checkpoint (please refer the guide for the checkpoint conversion).There were several root causes in this issue: Memory leak in engine build time / load the full HF models. We have fixed the memory leaks caused in engine build time as well as we allow to load the weights shard-by-shard (not full model), the memory footprint is now significantly reduced in engine build. Thanks for your valuable feedbacks.
Does the --load_by_shard
option also support the qwen-72b model on trt-llm v0.7.0? I have two A6000, single node host memory are 62GB
I am trying to run CodeLlama with the following setup:
Model size: 34B GPUs: 2x A6000 (sm_86)
I'd like to to run the model tensor-parallel across the two GPUs. Correct me if I'm wrong, but the "rank" refers to a particular GPU. TensorRT builds separate engines for each rank. It seems the engine successfully builds for rank 0 but not rank 1: Here is my build command:
And it outputs a non-descriptive "Killed" message after building the engine for rank 0:
I'm using release-0.5 and the docker setup. Please let me know if there's any additional information that would help with debugging this.