NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.23k stars 913 forks source link

Mixtral convertation OOM Fix #1533

Open hawkeoni opened 4 months ago

hawkeoni commented 4 months ago

System Info

System Info

Who can help?

@byshiue

Information

Tasks

Reproduction

tl;dr - Mixtral quantization fails on 2xH100 80gb gpus and a propose a small fix for it into nvidia-ammo.

Hi! I've been trying to convert mixtral in fp8 using the latest version of TensorRT-LLM but I had an OOM error:

Cannot export model to the model_config. The AMMO optimized model state_dict (including the quantization factors) is saved to tllm_checkpoint_mixtral_2gpu/ammo_model.0.pth using torch.save 
for further inspection.                                                                                                                                                                      
Detailed export error: CUDA out of memory. Tried to allocate 1.75 GiB. GPU 0 has a total capacity of 79.11 GiB of which 166.62 MiB is free. Process 970747 has 78.93 GiB memory in use. Of th
e allocated memory 78.15 GiB is allocated by PyTorch, and 132.23 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=
expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)                           
Traceback (most recent call last):                                                                                                                                                           
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_export.py", line 332, in export_tensorrt_llm_checkpoint                                                       
    for tensorrt_llm_config, weights in torch_to_tensorrt_llm_checkpoint(                                                                                                                    
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_export.py", line 204, in torch_to_tensorrt_llm_checkpoint                                                     
    build_decoder_config(layer, model_metadata_config, decoder_type, dtype)                                                                                                                  
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 1149, in build_decoder_config                                                                        
    config.mlp = build_moe_config(layer, decoder_type, dtype)                                                                                                                                
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 978, in build_moe_config                                                                             
    experts.fc, experts.proj = build_stacked_experts(module.experts, dtype)                                                                                                                  
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 892, in build_stacked_experts                                                                        
    experts_weight_1.weight = torch.concat(                                                                                                                                                  
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.75 GiB. GPU 0 has a total capacity of 79.11 GiB of which 166.62 MiB is free. Process 970747 has 78.93 GiB memory in use.
 Of the allocated memory 78.15 GiB is allocated by PyTorch, and 132.23 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC
_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)                     
Traceback (most recent call last):                                                                                                                                                           
  File "/app/tensorrt_llm/examples/quantization/quantize.py", line 52, in <module>                                                                                                           
    quantize_and_export(model_dir=args.model_dir,                                                                                                                                            
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_ammo.py", line 335, in quantize_and_export                                                             
    with open(f"{export_path}/config.json", "r") as f:                                                                                                                                       
FileNotFoundError: [Errno 2] No such file or directory: './tllm_checkpoint_mixtral_2gpu/config.json'    

I'm launching the convertation using the official script from here:

# Quantize HF Mixtral into FP8 and export trtllm checkpoint
python ../quantization/quantize.py --model_dir ./Mixtral-8x7B-v0.1 \
                                   --dtype float16 \
                                   --qformat fp8 \
                                   --kv_cache_dtype fp8 \
                                   --output_dir ./tllm_checkpoint_mixtral_2gpu \
                                   --calib_size 512 \
                                   --tp_size 2

# Build trtllm engines from the trtllm checkpoint
# Enable fp8 context fmha to get further acceleration by setting `--use_fp8_context_fmha enable`
trtllm-build --checkpoint_dir ./tllm_checkpoint_mixtral_2gpu \
             --output_dir ./engine_outputs \
             --gemm_plugin float16 \
             --strongly_typed \
             --workers 2

I've had 2 H100 gpus with 81gbs of memory and the first gpu had 80gb of memory allocated and the second had 40gb and the convertation failed.

Expected behavior

Successful model convertation

actual behavior

OOM Failure

additional notes

I've managed to fix it by going deep into nvidia-ammo. My version is:

Name: nvidia-ammo
Version: 0.9.3

I've found that in the file ammo/torch/export/layer_utils.py in function _build_stacked_linear tensor concatenation results in OOM so I fixed it by moving tensors on the cpu

def _build_stacked_linear(experts: nn.Module, module_name, linear_type, dtype):
    config = LinearConfig(linear_type=linear_type)

    first_module = getattr(experts[0], module_name)
    # weights
    config.weight = torch.stack(
        [getattr(e, module_name).weight.detach().type(dtype).cpu() for e in experts] # <-- added `cpu()` here
    )

And it fixed the problem and my gpus both had 46gb used.

I do not have any access to nvidia-ammo codebase, hopefully this helps everyone running into this issue and maybe some sort of .cpu() fix can be merged into nvidia-ammo for the next release?

MrD005 commented 4 months ago

i am doing same fp8 quantization but on llama-2 34b model and using 4xH100 and i am facing the same issue as well

Slyne commented 4 months ago

Saw similar issue with A10G + llama3 8B. Also solved this issue with similar tricks by manually edit the source code in ammo to move tensors from gpu to cpu.

nv-guomingz commented 3 months ago

trt-llm will add the --device knob in coming release, then you can specify the --device cpu to avoid such oom issues.