NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.11k stars 896 forks source link

How to control out of memory error with PYTORCH_CUDA_ALLOC_CONF? #1964

Closed mahmoodn closed 1 month ago

mahmoodn commented 1 month ago

I am using quantize.py according to GPTJ inference guide and the command is

python examples/quantization/quantize.py \
    --dtype=float16  \
    --output_dir=./model/GPTJ-6B/fp8-quantized-ammo/GPTJ-FP8-quantized \
    --model_dir=./model/GPTJ-6B/checkpoint-final/ \
    --qformat=fp8 --kv_cache_dtype=fp8

That command however fails with an out-of-memory error message:

Calibrating batch 510
Calibrating batch 511
Quantization done. Total time used: 1068.72 s.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
Cannot export model to the model_config. The AMMO optimized model state_dict (including the quantization factors) is saved to model/GPTJ-6B/fp8-quantized-ammo/GPTJ-FP8-quantized/ammo_model.0.pth using torch.save for further inspection.
Detailed export error: CUDA out of memory. Tried to allocate 128.00 MiB. GPU 0 has a total capacity of 9.77 GiB of which 36.06 MiB is free. Process 230131 has 9.56 GiB memory in use. Of the allocated memory 9.29 GiB is allocated by PyTorch, and 18.48 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/home/mnaderan/.local/lib/python3.10/site-packages/ammo/torch/export/model_config_export.py", line 307, in export_model_config
    for model_config in torch_to_model_config(
  File "/home/mnaderan/.local/lib/python3.10/site-packages/ammo/torch/export/model_config_export.py", line 185, in torch_to_model_config
    build_decoder_config(layer, model_metadata_config, decoder_type, dtype)
  File "/home/mnaderan/.local/lib/python3.10/site-packages/ammo/torch/export/layer_utils.py", line 944, in build_decoder_config
    config.mlp = build_mlp_config(layer, decoder_type, dtype)
  File "/home/mnaderan/.local/lib/python3.10/site-packages/ammo/torch/export/layer_utils.py", line 764, in build_mlp_config
    config.fc = build_linear_config(layer, LINEAR_COLUMN, dtype)
  File "/home/mnaderan/.local/lib/python3.10/site-packages/ammo/torch/export/layer_utils.py", line 591, in build_linear_config
    weight = torch_weight.type(dtype)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB. GPU 0 has a total capacity of 9.77 GiB of which 36.06 MiB is free. Process 230131 has 9.56 GiB memory in use. Of the allocated memory 9.29 GiB is allocated by PyTorch, and 18.48 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Quantized model exported to ./model/GPTJ-6B/fp8-quantized-ammo/GPTJ-FP8-quantized 

I searched for PYTORCH_CUDA_ALLOC_CONF to see how to use that. I tried different values and even when I use 32 (minimum is 20) with export 'PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:32' prior to running the Python command, I still get the same error.

I have one RTX3080 device with 10GB of memory. Any idea on how to fix that? I don't know if that is a TensortRT-LLM issue or Pytorch issue, so any idea would greatly help.

QiJune commented 1 month ago

Hi @mahmoodn , the 10GB device memory is not enough to quantize the GPT-J model. Please refer to the answer of a similar issue: https://github.com/NVIDIA/TensorRT-LLM/issues/1932#issuecomment-2227560712

mahmoodn commented 1 month ago

Thanks for the reply. Unfortunately, I don't have access to A100 (Ampere). If there is no option to create smaller chunks (if any) in order to reduce the GPU memory at a given time, then that is bad...

mahmoodn commented 1 month ago

Or, I would like to ask if there are quantized files publicly available for those who don't have the computing resource?

QiJune commented 1 month ago

Hi @mahmoodn , we do have plans to upload pre-quantized weights to HF model hub in the future.