NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.22k stars 912 forks source link

Mixtral engine build gives CUDA OOM on 8 40GB GPUs (0.8.0 release) #1304

Closed vnkc1 closed 4 months ago

vnkc1 commented 6 months ago

System Info

p4d with 8 GPUs : NVIDIA A100 40GB x 8

package version tensorrt-9.2.0.post12.dev5-cp310-none-linux_x86_64.whl [TensorRT-LLM] TensorRT-LLM version: 0.8.00.8.0

Who can help?

@byshiue

Information

Tasks

Reproduction

  1. Installation python -m pip install tensorrt_llm==0.8.0 --extra-index-url https://pypi.nvidia.com

  2. Create checkpoint python ./examples/llama/convert_checkpoint.py --model_dir ~/Mixtral-8x7B-Instruct-v0.1 --output_dir ~/checkpoints/Mixtral-8x7B-Instruct-v0.1/bf16-tp8 --dtype bfloat16 --tp_size 8 --workers 8

Expected behavior

Successful checkpoint creation

actual behavior

CUDA Out-Of-Memory on 1 out of 8 GPUs

additional notes

I believe a 56B model should comfortably compile on 8 40GB GPUs; may I have some info as to why this is occurring and how to estimate GPU memory required to build a Mixtral engine?

vnkc1 commented 6 months ago

OOM occurs at: https://github.com/NVIDIA/TensorRT-LLM/blob/v0.8.0/examples/llama/convert_checkpoint.py#L960

torch.concat([w3, w1], dim=-2) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 1 has a total capacty of 39.39 GiB of which 98.38 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 38.75 GiB is allocated by PyTorch, and 61.61 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

nivibilla commented 6 months ago

see #1156

vnkc1 commented 4 months ago

@mickaelseznec, here's the reproduction on 8xA100 (40GB) using the 0.9.0 release

$ python examples/llama/convert_checkpoint.py --model_dir ./Mixtral-8x22B-Instruct-v0.1 --output_dir ./ckpt --dtype float16 --tp_size 8

[TensorRT-LLM] TensorRT-LLM version: 0.9.0 0.9.0 Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 59/59 [12:56<00:00, 13.16s/it] Traceback (most recent call last): in main() in main convert_and_save_hf(args) in convert_and_save_hf execute(args.workers, [convert_and_save_rank] * world_size, args) in execute f(args, rank) in convert_and_save_rank llama = LLaMAForCausalLM.from_hugging_face( in from_hugging_face llama = convert.from_hugging_face( in from_hugging_face weights = load_weights_from_hf(config=config, in load_weights_from_hf weights = convert_hf_llama( in convert_hf_llama convert_layer(l) in convert_layer f'model.layers.{l}.block_sparse_moe.experts.w3w1.weight'] = torch.concat(

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 384.00 MiB. GPU 0 has a total capacity of 39.39 GiB of which 134.81 MiB is free. Process 33539 has 0 bytes memory in use. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 38.31 GiB is allocated by PyTorch, and 61.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expand

djns99 commented 4 months ago

@vnkc1 can you try the solution of switching device map from auto to cpu as suggested here https://github.com/NVIDIA/TensorRT-LLM/issues/1440

vnkc1 commented 4 months ago

@djns99 I cannot load model onto CPU as I will be running quantization calibration.

djns99 commented 4 months ago

I'm not sure I understand how that prevents you loading on the CPU? If you are quantizing to FP8 (Hopper only) you should be using quantize.py. If you are quantizing int8, only symmetric weight only quantize is currently supported, and that quantization runs on the CPU (and does not require calibration)