Open wx971025 opened 2 months ago
I quantized llama 3 70B on 3x A6000 48GB. Did you adjust the calibration dataset?
I quantized llama 3 70B on 3x A6000 48GB. Did you adjust the calibration dataset?
No, I did not change the calibration dataset
Ahh I see the issue. This is a transformers issue where they have a memory leak in their cache.
if you see the examples/quantize.py, we use in the use_cache: False argument which fixes this
Ahh I see the issue. This is a transformers issue where they have a memory leak in their cache.
if you see the examples/quantize.py, we use in the use_cache: False argument which fixes this
This doesn't seem to work, This still cause OOM
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 118.00 GiB. GPU 0 has a total capacity of 79.33 GiB of which 3.88 GiB is free. Process 32036 has 29.30 GiB memory in use. Process 107088 has 944.00 MiB memory in use. Process 107074 has 944.00 MiB memory in use. Process 107092 has 972.00 MiB memory in use. Process 107084 has 946.00 MiB memory in use. Process 107099 has 950.00 MiB memory in use. Process 114985 has 1.49 GiB memory in use. Process 114975 has 1.49 GiB memory in use. Process 115051 has 1.49 GiB memory in use. Process 115031 has 1.49 GiB memory in use. Process 115003 has 1.49 GiB memory in use. Process 107101 has 31.84 GiB memory in use. Including non-PyTorch memory, this process has 2.12 GiB memory in use. Of the allocated memory 1.63 GiB is allocated by PyTorch, and 9.75 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
One last thing that I noticed about your code that can cause OOM. You use device_map='auto'
which makes accelerate fill all GPUs with the modeling. It's better to set this to None
and allocate as much of the model on CPU RAM - AutoAWQ will then move layer by layer to GPU and quantize.
By the way, this is a known issue. AWQ batches 128 samples through the forward pass of the model at the same time. A fix is being worked on where we can split the number of samples up into chunks for a more VRAM friendly experience
This still doesn't seem to work,my complete code is:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
import os
# os.environ["CUDA_VISIBLE_DEVICES"] = "2,3,4,5,6,7"
model_path = '/data1/models/llms/llama3_8b_it'
quant_path = '/data1/models/llms/llama3_8b_it_awq_int4'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
# Load model
model = AutoAWQForCausalLM.from_pretrained(
model_path, **{"low_cpu_mem_usage": True, "use_cache": False}
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Quantize
model.quantize(tokenizer, quant_config=quant_config)
# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
error message is
Traceback (most recent call last):
File "/data1/models/llms/awq_quant.py", line 18, in
One last thing that I noticed about your code that can cause OOM. You use
device_map='auto'
which makes accelerate fill all GPUs with the modeling. It's better to set this toNone
and allocate as much of the model on CPU RAM - AutoAWQ will then move layer by layer to GPU and quantize.
The above errors were reported in Transformers version 4.38.2, and I was able to get it working when I upgraded to 4.40.2
I used the example script in the readme to quantize llama3-8b
I used 6xA800,But still prompt
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 118.00 GiB. GPU 0 has a total capacity of 79.33 GiB of which 76.77 GiB is free.
How to set quantization parameters to reduce the gpu memory consumed during quantization?