NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.61k stars 828 forks source link

OOM when using quantize.py to quantize llama-like model #1285

Open andakai opened 4 months ago

andakai commented 4 months ago

System Info

Who can help?

@Tracin

Information

Tasks

Reproduction

When I want to quantize the model, AquilaChat2-34B, which architecture is Llama-like. I want to quantize the model using the commands like Llama, https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama. I want to do the int8_kvcache+awq quantization.

#!/bin/bash
python ../quantization/quantize.py --model_dir /tmp/AquilaChat2-34B/ \
                                   --output_dir ./tllm_aquila_checkpoint_1gpu_awq_int8_kv_cache \
                                   --dtype float16 \
                                   --qformat int4_awq \
                                   --awq_block_size 128 \
                                   --kv_cache_dtype int8 \
                                   --calib_size 1 \

Expected behavior

Quantize the model using int8_kvcache+awq method.

actual behavior

I try running the commands both out of and in the container, but both run into "CUDA OOM", the full log is:

[NeMo W 2024-03-12 13:29:16 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pydub/utils.py:170: RuntimeWarning: Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work
      warn("Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work", RuntimeWarning)

Initializing model from /tmp/AquilaChat2-34B/
Loading checkpoint shards:   0%|  | 0/7 [00:00<?, ?it/s][NeMo W 2024-03-12 13:29:19 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/torch/_utils.py:836: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
      return self.fget.__get__(instance, owner)()

Loading checkpoint shards: 100%|█| 7/7 [00:14<00:00,  2.
Initializing tokenizer from /tmp/AquilaChat2-34B/

AWQ calibration could take longer than other calibration methods. Please increase the batch size to speed up the calibration process. Batch size can be set by adding the argument --batch_size <batch_size> to the command line.

Loading calibration dataset
{'quant_cfg': {'*weight_quantizer': {'num_bits': 4, 'block_sizes': {-1: 128}, 'enable': True}, '*input_quantizer': {'enable': False}, '*lm_head*': {'enable': False}, '*output_layer*': {'enable': False}, 'default': {'enable': False}, '*.query_key_value.output_quantizer': {'num_bits': 8, 'axis': None, 'enable': True}, '*.Wqkv.output_quantizer': {'num_bits': 8, 'axis': None, 'enable': True}, '*.W_pack.output_quantizer': {'num_bits': 8, 'axis': None, 'enable': True}, '*.c_attn.output_quantizer': {'num_bits': 8, 'axis': None, 'enable': True}, '*.k_proj.output_quantizer': {'num_bits': 8, 'axis': None, 'enable': True}, '*.v_proj.output_quantizer': {'num_bits': 8, 'axis': None, 'enable': True}}, 'algorithm': {'method': 'awq_lite', 'alpha_step': 0.1}}
Starting quantization...
Replaced 1263 modules to quantized modules
Caching activation statistics for awq_lite...
Calibrating batch 0
Loading extension ammo_cuda_ext...
Loading extension ammo_cuda_ext_fp8...
Searching awq_lite parameters...
Calibrating batch 0
[NeMo W 2024-03-12 13:29:40 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:153: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
      self.register_buffer("_pre_quant_scale", torch.tensor(value))

[NeMo W 2024-03-12 13:29:40 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:155: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
      value = torch.tensor(value, device=self._pre_quant_scale.device)

Calibrating batch 0
Quantization done. Total time used: 16.35 s.
Unknown model type AquilaForCausalLM. Continue exporting...
Warning: export_npz is going to be deprecated soon and replaced by safetensors.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
Padding vocab_embedding and lm_head for AWQ weights export
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
current rank: 0, tp rank: 0, pp rank: 0
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
Cannot export model to the model_config. The AMMO optimized model state_dict (including the quantization factors) is saved to tllm_aquila_checkpoint_1gpu_awq_int8_kv_cache/ammo_model.0.pth using torch.save for further inspection.
Detailed export error: CUDA out of memory. Tried to allocate 288.00 MiB. GPU 0 has a total capacity of 39.39 GiB of which 98.81 MiB is free. Process 1762589 has 39.28 GiB memory in use. Of the allocated memory 38.72 GiB is allocated by PyTorch, and 69.27 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_export.py", line 316, in export_model_config
    model_config_dict = model_config_to_dict(model_config)
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_utils.py", line 49, in model_config_to_dict
    return dataclasses.asdict(model_config)
  File "/usr/lib/python3.10/dataclasses.py", line 1238, in asdict
    return _asdict_inner(obj, dict_factory)
  File "/usr/lib/python3.10/dataclasses.py", line 1245, in _asdict_inner
    value = _asdict_inner(getattr(obj, f.name), dict_factory)
  File "/usr/lib/python3.10/dataclasses.py", line 1273, in _asdict_inner
    return type(obj)(_asdict_inner(v, dict_factory) for v in obj)
  File "/usr/lib/python3.10/dataclasses.py", line 1273, in <genexpr>
    return type(obj)(_asdict_inner(v, dict_factory) for v in obj)
  File "/usr/lib/python3.10/dataclasses.py", line 1245, in _asdict_inner
    value = _asdict_inner(getattr(obj, f.name), dict_factory)
  File "/usr/lib/python3.10/dataclasses.py", line 1245, in _asdict_inner
    value = _asdict_inner(getattr(obj, f.name), dict_factory)
  File "/usr/lib/python3.10/dataclasses.py", line 1245, in _asdict_inner
    value = _asdict_inner(getattr(obj, f.name), dict_factory)
  File "/usr/lib/python3.10/dataclasses.py", line 1279, in _asdict_inner
    return copy.deepcopy(obj)
  File "/usr/lib/python3.10/copy.py", line 153, in deepcopy
    y = copier(memo)
  File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 122, in __deepcopy__
    new_storage = self._typed_storage()._deepcopy(memo)
  File "/usr/local/lib/python3.10/dist-packages/torch/storage.py", line 839, in _deepcopy
    return self._new_wrapped_storage(copy.deepcopy(self._untyped_storage, memo))
  File "/usr/lib/python3.10/copy.py", line 153, in deepcopy
    y = copier(memo)
  File "/usr/local/lib/python3.10/dist-packages/torch/storage.py", line 112, in __deepcopy__
    new_storage = self.clone()
  File "/usr/local/lib/python3.10/dist-packages/torch/storage.py", line 126, in clone
    return type(self)(self.nbytes(), device=self.device).copy_(self)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 288.00 MiB. GPU 0 has a total capacity of 39.39 GiB of which 98.81 MiB is free. Process 1762589 has 39.28 GiB memory in use. Of the allocated memory 38.72 GiB is allocated by PyTorch, and 69.27 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Quantized model exported to ./tllm_aquila_checkpoint_1gpu_awq_int8_kv_cache 
Total time used 419.97 s

additional notes

I also tried it on 4xA100-40G, still cuda oom.

andakai commented 4 months ago

I build a new image with the latest version updated in https://github.com/NVIDIA/TensorRT-LLM/pull/1274, followed the doc https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/build_from_source.md#option-1-build-tensorrt-llm-in-one-step. I find that the quantize.py has changed. However, when I run the quantization code on 2xA100-40G to quantize the model Aquilachat2-34B, an OOM still occurs.

python ../quantization/quantize.py --model_dir /tmp/AquilaChat2-34B \
                                   --dtype float16 \
                                   --qformat int4_awq \
                                   --awq_block_size 128 \
                                   --output_dir ./quantized_int4-awq \
                                   --calib_size 1 \

Does the 34B model indeed needs such a huge memory?

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."