casper-hansen / AutoAWQ

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
https://casper-hansen.github.io/AutoAWQ/
MIT License
1.47k stars 168 forks source link

Reduce the amount of gpu memory used in the quantification process #482

Open wx971025 opened 2 months ago

wx971025 commented 2 months ago

I used the example script in the readme to quantize llama3-8b

quant_config = { "zero_point": True, "q_group_size": 16, "w_bit": 4, "version": "GEMM" }
model = AutoAWQForCausalLM.from_pretrained(model_path, **{"low_cpu_mem_usage": True}, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model.quantize(tokenizer, quant_config=quant_config)

I used 6xA800,But still prompt torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 118.00 GiB. GPU 0 has a total capacity of 79.33 GiB of which 76.77 GiB is free. How to set quantization parameters to reduce the gpu memory consumed during quantization?

casper-hansen commented 2 months ago

I quantized llama 3 70B on 3x A6000 48GB. Did you adjust the calibration dataset?

wx971025 commented 2 months ago

I quantized llama 3 70B on 3x A6000 48GB. Did you adjust the calibration dataset?

No, I did not change the calibration dataset

casper-hansen commented 2 months ago

Ahh I see the issue. This is a transformers issue where they have a memory leak in their cache.

if you see the examples/quantize.py, we use in the use_cache: False argument which fixes this

wx971025 commented 2 months ago

Ahh I see the issue. This is a transformers issue where they have a memory leak in their cache.

if you see the examples/quantize.py, we use in the use_cache: False argument which fixes this

This doesn't seem to work, This still cause OOM torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 118.00 GiB. GPU 0 has a total capacity of 79.33 GiB of which 3.88 GiB is free. Process 32036 has 29.30 GiB memory in use. Process 107088 has 944.00 MiB memory in use. Process 107074 has 944.00 MiB memory in use. Process 107092 has 972.00 MiB memory in use. Process 107084 has 946.00 MiB memory in use. Process 107099 has 950.00 MiB memory in use. Process 114985 has 1.49 GiB memory in use. Process 114975 has 1.49 GiB memory in use. Process 115051 has 1.49 GiB memory in use. Process 115031 has 1.49 GiB memory in use. Process 115003 has 1.49 GiB memory in use. Process 107101 has 31.84 GiB memory in use. Including non-PyTorch memory, this process has 2.12 GiB memory in use. Of the allocated memory 1.63 GiB is allocated by PyTorch, and 9.75 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

casper-hansen commented 2 months ago

One last thing that I noticed about your code that can cause OOM. You use device_map='auto' which makes accelerate fill all GPUs with the modeling. It's better to set this to None and allocate as much of the model on CPU RAM - AutoAWQ will then move layer by layer to GPU and quantize.

casper-hansen commented 2 months ago

By the way, this is a known issue. AWQ batches 128 samples through the forward pass of the model at the same time. A fix is being worked on where we can split the number of samples up into chunks for a more VRAM friendly experience

wx971025 commented 2 months ago

This still doesn't seem to work,my complete code is:

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
import os

# os.environ["CUDA_VISIBLE_DEVICES"] = "2,3,4,5,6,7"

model_path = '/data1/models/llms/llama3_8b_it'
quant_path = '/data1/models/llms/llama3_8b_it_awq_int4'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

# Load model
model = AutoAWQForCausalLM.from_pretrained(
    model_path, **{"low_cpu_mem_usage": True, "use_cache": False}
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

error message is Traceback (most recent call last): File "/data1/models/llms/awq_quant.py", line 18, in model.quantize(tokenizer, quant_config=quant_config) File "/home/wangxu/anaconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/home/wangxu/anaconda3/lib/python3.10/site-packages/awq/models/base.py", line 162, in quantize self.quantizer = AwqQuantizer( File "/home/wangxu/anaconda3/lib/python3.10/site-packages/awq/quantize/quantizer.py", line 59, in init self.modules, self.module_kwargs, self.inps = self.init_quant() File "/home/wangxu/anaconda3/lib/python3.10/site-packages/awq/quantize/quantizer.py", line 478, in init_quant self.model(samples.to(next(self.model.parameters()).device)) File "/home/wangxu/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/home/wangxu/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/home/wangxu/anaconda3/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1176, in forward outputs = self.model( File "/home/wangxu/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/home/wangxu/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, **kwargs) File "/home/wangxu/anaconda3/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 993, in forward causal_mask = self._update_causal_mask(attention_mask, inputs_embeds) File "/home/wangxu/anaconda3/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1076, in _update_causal_mask causal_mask = causal_mask.to(dtype=dtype, device=device) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 118.00 GiB. GPU 0 has a total capacity of 79.33 GiB of which 35.74 GiB is free. Process 32036 has 29.30 GiB memory in use. Process 107088 has 944.00 MiB memory in use. Process 107074 has 944.00 MiB memory in use. Process 107092 has 972.00 MiB memory in use. Process 107084 has 946.00 MiB memory in use. Process 107099 has 950.00 MiB memory in use. Process 114985 has 1.49 GiB memory in use. Process 114975 has 1.49 GiB memory in use. Process 115051 has 1.49 GiB memory in use. Process 115031 has 1.49 GiB memory in use. Process 115003 has 1.49 GiB memory in use. Including non-PyTorch memory, this process has 2.12 GiB memory in use. Of the allocated memory 1.63 GiB is allocated by PyTorch, and 9.75 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) I don’t quite understand why the 8b model needs to apply for 118GB of GPU memory

wx971025 commented 2 months ago

One last thing that I noticed about your code that can cause OOM. You use device_map='auto' which makes accelerate fill all GPUs with the modeling. It's better to set this to None and allocate as much of the model on CPU RAM - AutoAWQ will then move layer by layer to GPU and quantize.

The above errors were reported in Transformers version 4.38.2, and I was able to get it working when I upgraded to 4.40.2