High CPU memory usage (Llama build Killed)

atyshka commented 11 months ago

I am trying to run CodeLlama with the following setup:

Model size: 34B GPUs: 2x A6000 (sm_86)

I'd like to to run the model tensor-parallel across the two GPUs. Correct me if I'm wrong, but the "rank" refers to a particular GPU. TensorRT builds separate engines for each rank. It seems the engine successfully builds for rank 0 but not rank 1: Here is my build command:

python build.py --meta_ckpt_dir ../../models/CodeLlama-34b-Instruct/ --dtype float16     --remove_input_padding --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --use_rmsno
rm_plugin float16     --enable_context_fmha --output_dir codellama_34b --rotary_base 1000000 --vocab_size 32000 --world_size 2 --tp_size 2

And it outputs a non-descriptive "Killed" message after building the engine for rank 0:

[10/24/2023-20:31:06] [TRT-LLM] [I] Serially build TensorRT engines.                                                                                                                                                                            
[10/24/2023-20:31:06] [TRT] [I] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 125, GPU 5521 (MiB)                                                                                                                                       
[10/24/2023-20:31:13] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1799, GPU +312, now: CPU 2060, GPU 5833 (MiB)                                                                                                                
[10/24/2023-20:31:13] [TRT-LLM] [W] Invalid timing cache, using freshly created one                                                                                                                                                             
[10/24/2023-20:31:24] [TRT-LLM] [I] Loading weights from Meta LLaMA checkpoints ...   
023-20:32:26] [TRT-LLM] [I] Weights loaded. Total time: 00:01:01                                                                                                                                                                                
[10/24/2023-20:32:27] [TRT-LLM] [I] Context FMHA Enabled                                                                                                                                                                                        
[10/24/2023-20:32:27] [TRT-LLM] [I] Remove Padding Enabled                                                                                                                                                                                      
[10/24/2023-20:32:27] [TRT-LLM] [I] Build TensorRT engine llama_float16_tp2_rank0.engine                                                                                                                                                        
[10/24/2023-20:32:27] [TRT] [W] Unused Input: position_ids                                                                                                                                                                                      
[10/24/2023-20:32:27] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.                                                                                                  
[10/24/2023-20:32:27] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 34548, GPU 7347 (MiB)                                                                                                                           
[10/24/2023-20:32:27] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 34550, GPU 7357 (MiB)                                                                                                                                    
[10/24/2023-20:32:27] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.     
[10/24/2023-20:32:50] [TRT] [W] Tactic Device request: 66048MB Available: 48676MB. Device memory is insufficient to use tactic.
[10/24/2023-20:32:50] [TRT] [W] UNSUPPORTED_STATESkipping tactic 2 due to insufficient memory on requested size of 66048 detected for tactic 0x000000000000001a.
[10/24/2023-20:32:54] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[10/24/2023-20:32:54] [TRT] [I] Detected 57 inputs and 51 output network tensors.
[10/24/2023-20:33:02] [TRT] [I] Total Host Persistent Memory: 147984
[10/24/2023-20:33:02] [TRT] [I] Total Device Persistent Memory: 0
[10/24/2023-20:33:02] [TRT] [I] Total Scratch Memory: 33620096
[10/24/2023-20:33:02] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 924 steps to complete.
[10/24/2023-20:33:02] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 30.3485ms to assign 11 blocks to 924 nodes requiring 1384123392 bytes.
[10/24/2023-20:33:02] [TRT] [I] Total Activation Memory: 1384123392
[10/24/2023-20:33:02] [TRT] [I] Total Weights Memory: 34006908952
[10/24/2023-20:33:02] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 35115, GPU 39801 (MiB)
[10/24/2023-20:33:02] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 35115, GPU 39811 (MiB)
[10/24/2023-20:33:02] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 1000 MiB, GPU 32432 MiB
[10/24/2023-20:33:02] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +32432, now: CPU 0, GPU 32432 (MiB)
[10/24/2023-20:33:12] [TRT] [I] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 103590 MiB
[10/24/2023-20:33:12] [TRT-LLM] [I] Total time of building llama_float16_tp2_rank0.engine: 00:00:44
[10/24/2023-20:33:12] [TRT-LLM] [I] Config saved to codellama_34b/config.json.
[10/24/2023-20:33:12] [TRT-LLM] [I] Serializing engine to codellama_34b/llama_float16_tp2_rank0.engine...
[10/24/2023-20:33:55] [TRT-LLM] [I] Engine serialized. Total time: 00:00:42
[10/24/2023-20:34:05] [TRT-LLM] [I] Loading weights from Meta LLaMA checkpoints ...
Killed

I'm using release-0.5 and the docker setup. Please let me know if there's any additional information that would help with debugging this.

atyshka commented 11 months ago

It appears to be a memory issue. While TRT saves the engine for rank 0, memory consumption is at about 67 GB. Then I get OOM when rank 1 starts and memory consumption reaches 128GB (on the CPU). This could be "expected behavior" but I think it's strange that none of the memory seems to be freed after building the first engine. It seems the weights are getting loaded twice, that might be a bug.

jdemouth-nvidia commented 11 months ago

Thanks for reporting that issue @atyshka - it’s indeed suspicious that we use that much memory. We’ll investigate.

atyshka commented 11 months ago

Thanks for reporting that issue @atyshka - it’s indeed suspicious that we use that much memory. We’ll investigate.

Thanks! For now I'll investigate if pre-quantizing to 8 bit will help

haojiwei commented 11 months ago

Same issue in building engine prograss for multiple world_size. BTW, It seems build the engine for different rank serially and just utilize a single GPU.

wm2012011492 commented 11 months ago

Is that possible that you could try load_from_binary function? It should avoid the CPU OOM issue.

@haojiwei the build.py provides the flag --parallel_build for parallelly building TensorRT engines

atyshka commented 11 months ago

@wm2012011492 How can I use load_from_binary without quantization? The hf_llama_convert script doesn't seem to work without quantization: AssertionError: Either INT8 kv cache or SmoothQuant must be enabled for this script. Otherwise you can directly build engines from HuggingFace checkpoints, no need to do this FT-format conversion.

juney-nvidia commented 11 months ago

@wm2012011492 How can I use load_from_binary without quantization? The hf_llama_convert script doesn't seem to work without quantization: AssertionError: Either INT8 kv cache or SmoothQuant must be enabled for this script. Otherwise you can directly build engines from HuggingFace checkpoints, no need to do this FT-format conversion.

You can refer here for the usage of load_from_binary if you want.

atyshka commented 11 months ago

@wm2012011492 How can I use load_from_binary without quantization? The hf_llama_convert script doesn't seem to work without quantization: AssertionError: Either INT8 kv cache or SmoothQuant must be enabled for this script. Otherwise you can directly build engines from HuggingFace checkpoints, no need to do this FT-format conversion.

You can refer here for the usage of load_from_binary if you want.

Yes, but my question is how to convert the LLaMa checkpoints (in either original or HF format) to the binary file format to use load_from_binary? It's referenced as "FT" format which I assume refers to the previous FasterTransformer library. As far as I know hf_llama_convert.py is the only way to covert checkpoints to binary files, but that script doesn't support conversion without quantization

jaedeok-nvidia commented 11 months ago

Thanks for reporting the issue.

I guess there might exist some memory leak in weight loading. Can you please check if this works? Revise the split method by

def split(v, tp_size, idx, dim=0):
    if tp_size == 1:
        return v
    if len(v.shape) == 1:
        return np.ascontiguousarray(np.split(v, tp_size)[idx].copy())
    else:
        return np.ascontiguousarray(np.split(v, tp_size, axis=dim)[idx].copy())

(add .copy()). In my test using HF checkpoint, it alleviated the memory issue.

Especially during splitting weights, numpy splices a tensor and a sliced tensor shares the memory buffer of the original tensor. So, returning without copying makes the buffer of a loaded weight be referenced, resulting the garbage collector can’t release those weights until the whole process ends. There is a related discussion in stackoverflow: https://stackoverflow.com/questions/50195197/reduce-memory-usage-when-slicing-numpy-arrays

atyshka commented 11 months ago

Thanks for the suggestion @jaedeok-nvidia. Can you share the details of your model and hardware setup? I was originally using Meta checkpoints, and unfortunately your copy() patch didn't seem to resolve the issue, it still runs out of memory when loading the weights the second time around. When I try to use huggingface checkpoints instead, I can't even finish building the first engine without running out of my 128GB of CPU memory.

jaedeok-nvidia commented 11 months ago

I tested on DGX A100 (which has a lot host memory). Instead, I measured the memory usage. maybe 128GB is too small to build an engine under the current workflow. I'm trying to revise the workflow of the engine build in order to reduce the host memory footprint.

In the meantime, can you please try in this way? Load HF model to GPUs instead of CPU, to do this we may need to update the HF checkpoint loading like this,

        hf_llama = LlamaForCausalLM.from_pretrained(
            args.model_dir,
            device_map='auto',
            torch_dtype="auto")

Currently still there is some mem leak in model loading, so this may help us to save some more host memory. But, unfortunately it seems that it still exceeds 128GB host memory during engine build. If you fails, please try explicitly del engine after serialization (here)

npuichigo commented 11 months ago

loading baichuan-13B also consumes much memory and be killed

lynkz-matt-psaltis commented 11 months ago

~~Also try updating transformers - I found 4.34.1 allowed the model to at least serialize successfully.~~

Ok nope, managed to get one engine serialized out but every build after is back to OOM.

I'm using main branch commit: https://github.com/NVIDIA/TensorRT-LLM/commit/d8b408e6dcc1d45982a8b94399cd74b78f80befa

As it was noted to have improvements to help this problem in the discussion here: https://github.com/NVIDIA/TensorRT-LLM/discussions/153

I'm on 1xA100 (80GB) using https://huggingface.co/Phind/Phind-CodeLlama-34B-v2:

I've tried various derivatives of the following configuration as well as using weights only.

python ./tensorrt_llm/examples/llama/build.py \
    --model_dir ./Phind/Phind-CodeLlama-34B-v2/ \
    --output_dir ./Phind/Phind-CodeLlama-34B-v2-engine/ \
    --dtype float16 \
    --remove_input_padding \
    --use_gpt_attention_plugin float16 \
    --enable_context_fmha \
    --use_gemm_plugin float16 \
    --rotary_base 1000000 --vocab_size 32000 --world_size 1 --tp_size 1 \
    --enable_context_fmha \
    --use_parallel_embedding \
    --use_inflight_batching \
    --max_input_len 1024 \
    --max_output_len 1024 \
    --max_batch_size 8 \
    --parallel_build \
    --paged_kv_cache

Swapping from CPU to GPU for the Model and Heads as suggested above gives the following error:

device_map='auto' as well as device_map={"model": "cuda","lm_head": "cuda"}

[TRT] [E] 10: Could not find any implementation for node {ForeignNode[LLaMAForCausalLM/vocab_embedding/CONSTANT_0...LLaMAForCausalLM/layers/0/input_layernorm/ELEMENTWISE_PROD_0]}.
[TRT] [E] 10: [optimizer.cpp::computeCosts::4051] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[LLaMAForCausalLM/vocab_embedding/CONSTANT_0...LLaMAForCausalLM/layers
/0/input_layernorm/ELEMENTWISE_PROD_0]}.)

Please let me know if there's any other information / examples I should try to help diagnose. Are the other patches from 4 days ago still relevant given the changes from the commit above?

jaedeok-nvidia commented 11 months ago

Hi @lynkz-matt-psaltis , I tried to reproduce but it works well on my end. The error message says that there is an unsupported op in TRT side. Can you please share your testing env? like TensorRT version? An older TRT had a similar issue.

We've tested on TRT 9.1.0.4. If you need to upgrade TRT, please use the installation script docker/common/install_tensorrt.sh.

lynkz-matt-psaltis commented 11 months ago

Thanks @jaedeok-nvidia

I'm building from main (commit 4de32a86ae92bc49a7ec17c00ec2f2d03663c198) on nvcr.io/nvidia/tritonserver:23.10-py3 pip list shows: tensorrt 9.1.0.post12.dev4

Was your repro environment a single A100 80G? ~200Gig of RAM

Minami-su commented 11 months ago

I tested on DGX A100 (which has a lot host memory). Instead, I measured the memory usage. maybe 128GB is too small to build an engine under the current workflow. I'm trying to revise the workflow of the engine build in order to reduce the host memory footprint.

In the meantime, can you please try in this way? Load HF model to GPUs instead of CPU, to do this we may need to update the HF checkpoint loading like this,
        hf_llama = LlamaForCausalLM.from_pretrained(
            args.model_dir,
            device_map='auto',
            torch_dtype="auto")
Currently still there is some mem leak in model loading, so this may help us to save some more host memory. But, unfortunately it seems that it still exceeds 128GB host memory during engine build. If you fails, please try explicitly del engine after serialization (here)

[11/02/2023-12:40:16] [TRT-LLM] [I] HF Baichuan v2_13b loaded. Total time: 00:04:33 [11/02/2023-12:40:16] [TRT-LLM] [I] Loading weights from HF Baichuan v2_13b... Traceback (most recent call last): File "/app/tensorrt_llm/examples/baichuan/build.py", line 478, in build(0, args) File "/app/tensorrt_llm/examples/baichuan/build.py", line 448, in build engine = build_rank_engine(builder, builder_config, engine_name, File "/app/tensorrt_llm/examples/baichuan/build.py", line 348, in build_rank_engine load_from_hf_baichuan(tensorrt_llm_baichuan, File "/app/tensorrt_llm/examples/baichuan/weight.py", line 63, in load_from_hf_baichuan v = torch_to_numpy(v.to(torch_dtype).detach().cpu()) NotImplementedError: Cannot copy out of meta tensor; no data! https://github.com/NVIDIA/TensorRT-LLM/issues/229#issuecomment-1791008168

atyshka commented 11 months ago

@jaedeok-nvidia if I add the del, it still consumes more than 128GB of memory, but I can manage to make it work putting 10G on a swapfile.

I'm going to keep this open though because there's still quite a ways to go when it comes to memory efficiency. For my 34B model, the weights should take 68 GB in fp16 format. Obviously the actual engine building process is going to use significant memory, but I wouldn't expect it to be double the model size. Additionally, I think you should only need 1/n of the weights at a time for a config of n tensor parallel engines. So each of my two engines should in theory only need 34GB of weights.

jaedeok-nvidia commented 11 months ago

Hi @atyshka, the issue is due to the memory leak of the trt-llm model. Especially, the trt-llm model is not properly released after build_rank_engine() method. We have almost identified and I believe we fix it soon.

Additionally, even though tp size > 1, we have to load "full weights" to split the weights from a pertrained checkpoint. However, I agree, we will optimize the memory footprint more. For instance, we can reduce the memory footprint more by loading shard-by-shard instead of loading the model fully, like falcon example does.

jaedeok-nvidia commented 11 months ago

Hi @lynkz-matt-psaltis sorry for missing your reply. Hmm... it's weird. That issue is not about the memory usage. If you still have the issue, can you please create another issue to discuss it?

vicwer commented 11 months ago

loading gpt2-13b model also consumes much memory and be killed. script:

python3 build.py --model_dir=./c-model-13b/gpt2/1-gpu --use_gpt_attention_plugin=float16 --dtype=float16 --use_gemm_plugin=float16  --remove_input_padding

hardware:

               total        used        free      shared  buff/cache   available
Mem:           503Gi        68Gi        73Gi       1.9Gi       361Gi       412Gi
Low:           503Gi       429Gi        73Gi
---
GPU-v100 32510MiB

error log:

[11/07/2023-08:45:41] [TRT] [W] Requested amount of GPU memory (33555480576 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
[11/07/2023-08:45:41] [TRT] [W] UNSUPPORTED_STATESkipping tactic 2 due to insufficient memory on requested size of 33555480576 detected for tactic 0x000000000000001e.
[11/07/2023-08:45:41] [TRT] [E] 2: [virtualMemoryBuffer.cpp::resizePhysical::140] Error Code 2: OutOfMemory (no further information)
[11/07/2023-08:45:41] [TRT] [E] 2: [virtualMemoryBuffer.cpp::resizePhysical::140] Error Code 2: OutOfMemory (no further information)
[11/07/2023-08:45:41] [TRT] [W] Requested amount of GPU memory (33555480576 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
[11/07/2023-08:45:41] [TRT] [W] UNSUPPORTED_STATESkipping tactic 3 due to insufficient memory on requested size of 33555480576 detected for tactic 0x000000000000001f.
[11/07/2023-08:45:43] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[11/07/2023-08:45:43] [TRT] [I] Detected 49 inputs and 41 output network tensors.
[11/07/2023-08:46:10] [TRT] [I] Total Host Persistent Memory: 71248
[11/07/2023-08:46:10] [TRT] [I] Total Device Persistent Memory: 0
[11/07/2023-08:46:10] [TRT] [I] Total Scratch Memory: 4608992384
[11/07/2023-08:46:10] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 650 steps to complete.
[11/07/2023-08:46:10] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 35.8922ms to assign 11 blocks to 650 nodes requiring 8279011840 bytes.
[11/07/2023-08:46:10] [TRT] [I] Total Activation Memory: 8279011840
[11/07/2023-08:46:11] [TRT] [I] Total Weights Memory: 26245963592
[11/07/2023-08:46:11] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 26432, GPU 26871 (MiB)
[11/07/2023-08:46:11] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 26433, GPU 26881 (MiB)
[11/07/2023-08:46:11] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 1065 MiB, GPU 25031 MiB
[11/07/2023-08:46:11] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +25031, now: CPU 0, GPU 25031 (MiB)
[11/07/2023-08:46:41] [TRT] [I] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 51556 MiB
[11/07/2023-08:46:41] [TRT-LLM] [I] Total time of building gpt_float16_tp1_rank0.engine: 00:01:47
[11/07/2023-08:46:41] [TRT-LLM] [I] Config saved to gpt_outputs/config.json.
[11/07/2023-08:46:41] [TRT-LLM] [I] Serializing engine to gpt_outputs/gpt_float16_tp1_rank0.engine...
Killed

jfolz commented 11 months ago

I'd also like to chime in. Trying to build Llama2 70B engines in fp16 tp4 with less than 500GB host memory goes OOM before it can finish rank 3. Ignore red and yellow graphs, only the green (user) memory is relevant.

Here's the command I use on this RTX A6000 system:

python build.py --model_dir "$INDIR/hf-70B-chat/" \
                --output_dir "$ENGINE_DIR" \
                --remove_input_padding \
                --max_input_len 4096 \
                --max_output_len 4096 \
                --max_batch_size 16 \
                --dtype float16 \
                --enable_context_fmha \
                --use_gemm_plugin float16 \
                --use_gpt_attention_plugin float16 \
                --use_inflight_batching \
                --paged_kv_cache \
                --world_size 4 \
                --tp_size 4 \
                --pp_size 1

The issue is less pronounced when using int8:

And here's what the process looks like on DGX A100:

More memory, so it doesn't crash, but using this amount to build is just silly.

jaedeok-nvidia commented 10 months ago

Hi all, sorry for your inconvenience in engine build.

In the last week, we have updated to main branch for reducing the peak CPU memory footprint. Please use --load_by_shard option in LLaMA / BLOOM / Falcon models to reduce the memory footprint. For LLaMA model, --load_by_shard works for HF checkpoint only, so that please convert to HF first if you have Meta checkpoint (please refer the guide for the checkpoint conversion).

There were several root causes in this issue: Memory leak in engine build time / load the full HF models. We have fixed the memory leaks caused in engine build time as well as we allow to load the weights shard-by-shard (not full model), the memory footprint is now significantly reduced in engine build. Thanks for your valuable feedbacks.

Linzecong commented 9 months ago

load_from_binary

there is some error when use_weight_only my version is 0.6.1

my build command is

python /app/tensorrt_llm/examples/llama/build.py --model_dir /app/hf_model --dtype float16 --remove_input_padding  --use_gemm_plugin float16 --use_gpt_attention_plugin float16 --world_size 2 --tp_size 2 --enable_context_fmha  --use_inflight_batching  --paged_kv_cache  --load_by_shard  --use_weight_only  --weight_only_precision int8 --output_dir /app/triton_model/tensorrt_llm/1

the errror is

Traceback (most recent call last):
  File "/app/tensorrt_llm/examples/llama/build.py", line 839, in <module>
    build(0, args)
  File "/app/tensorrt_llm/examples/llama/build.py", line 783, in build
    engine = build_rank_engine(builder, builder_config, engine_name,
  File "/app/tensorrt_llm/examples/llama/build.py", line 631, in build_rank_engine
    load_from_hf_checkpoint(tensorrt_llm_llama,
  File "/app/tensorrt_llm/examples/llama/weight.py", line 522, in load_from_hf_checkpoint
    param = split_v.transpose()
TypeError: transpose() received an invalid combination of arguments - got (), but expected one of:
 * (int dim0, int dim1)
 * (name dim0, name dim1)

Linzecong commented 9 months ago

load_from_binary

there is some error when use_weight_only my version is 0.6.1

my build command is

python /app/tensorrt_llm/examples/llama/build.py --model_dir /app/hf_model --dtype float16 --remove_input_padding  --use_gemm_plugin float16 --use_gpt_attention_plugin float16 --world_size 2 --tp_size 2 --enable_context_fmha  --use_inflight_batching  --paged_kv_cache  --load_by_shard  --use_weight_only  --weight_only_precision int8 --output_dir /app/triton_model/tensorrt_llm/1

the errror is

Traceback (most recent call last):
  File "/app/tensorrt_llm/examples/llama/build.py", line 839, in <module>
    build(0, args)
  File "/app/tensorrt_llm/examples/llama/build.py", line 783, in build
    engine = build_rank_engine(builder, builder_config, engine_name,
  File "/app/tensorrt_llm/examples/llama/build.py", line 631, in build_rank_engine
    load_from_hf_checkpoint(tensorrt_llm_llama,
  File "/app/tensorrt_llm/examples/llama/weight.py", line 522, in load_from_hf_checkpoint
    param = split_v.transpose()
TypeError: transpose() received an invalid combination of arguments - got (), but expected one of:
 * (int dim0, int dim1)
 * (name dim0, name dim1)

solve by changing split_v.transpose() to

split_v.transpose(0,1).contiguous()

Burning-XX commented 9 months ago

branch

but what can we do on release 0.5.0 branch, I still met this problem with branch 0.5.0 and 32G CPU memory

liyunhan commented 6 months ago

Hi all, sorry for your inconvenience in engine build.

In the last week, we have updated to main branch for reducing the peak CPU memory footprint. Please use --load_by_shard option in LLaMA / BLOOM / Falcon models to reduce the memory footprint. For LLaMA model, --load_by_shard works for HF checkpoint only, so that please convert to HF first if you have Meta checkpoint (please refer the guide for the checkpoint conversion).

There were several root causes in this issue: Memory leak in engine build time / load the full HF models. We have fixed the memory leaks caused in engine build time as well as we allow to load the weights shard-by-shard (not full model), the memory footprint is now significantly reduced in engine build. Thanks for your valuable feedbacks.

Does the --load_by_shard option also support the qwen-72b model on trt-llm v0.7.0? I have two A6000, single node host memory are 62GB

NVIDIA / TensorRT-LLM

High CPU memory usage (Llama build Killed) #102