Closed activezhao closed 5 months ago
Hi @activezhao, this looks like a transformers issue. They have been having issues with their cache ever since 4.36.0. The current workaround is to pass **{"use_cache": False}
until they fix it.
Hi @activezhao, this looks like a transformers issue. They have been having issues with their cache ever since 4.36.0. The current workaround is to pass
**{"use_cache": False}
until they fix it.
@casper-hansen OK,Thanks for your reply, I will just try it.
Hi @activezhao, this looks like a transformers issue. They have been having issues with their cache ever since 4.36.0. The current workaround is to pass
**{"use_cache": False}
until they fix it.
@casper-hansen one more thing, will AutoAWQ support multi GPUs when quantization?
Thanks.
Hi @activezhao, this looks like a transformers issue. They have been having issues with their cache ever since 4.36.0. The current workaround is to pass
**{"use_cache": False}
until they fix it.
Hi @casper-hansen I tried to add the parameter of **{"use_cache": False}
, but the same error occured again.
Is there any other way?
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = '/data/deepseek-6.7B-tencent'
quant_path = '/data/deepseek-6.7B-tencent-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path, safetensors=True, **{"low_cpu_mem_usage": True, "use_cache": False})
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Quantize
model.quantize(tokenizer, quant_config=quant_config)
# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 33.00 GiB. GPU 0 has a total capacty of 22.20 GiB of which 21.06 GiB is free. Process 1957804 has 1.14 GiB memory in use. Of the allocated memory 900.52 MiB is allocated by PyTorch, and 3.48 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
You should not be running into OOM issues with this configuration. You only need about 16GB of VRAM to fit a 7B model into memory. In my own testing, I have not ran into these issues, so I am unsure what is causing the issue for you
Checkout this issue, which was closed as not related, but is probably the reason for your problems: https://github.com/casper-hansen/AutoAWQ/issues/372
Checkout this issue, which was closed as not related, but is probably the reason for your problems: https://github.com/casper-hansen/AutoAWQ/issues/372
@DreamGenX OK, thanks
By the way, what's the version of your CUDA? My CUDA version is 12.3, too high?
I encountered the same problem when quantifying deepseek-coder-1.3b, which ended up consuming 37G of VRAM. Both AutoAWQ and transformers are installed from source.
I heard somewhere that this could also be due to large rope_theta. All of these models have rope_theta >>10000 (which if the common value for old llama2 models and the base mistral 7b).
I encountered the same problem when quantifying deepseek-coder-1.3b, which ended up consuming 37G of VRAM. Both AutoAWQ and transformers are installed from source.
@TechxGenus Have u solved?
I have solved the error, I change the transformers from 4.38.1 to 4.37.2, It works.
Thanks to all u guys.
Hi @casper-hansen I deploy the awq model of deepseek-6.7b
with vLLM
, but the human eval score
is only 18, the normal score is more than 40, is there something wrong with how I handled it?
Thanks.
The python file is:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = '/data/deepseek-6.7B'
quant_path = '/data/deepseek-6.7B-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path, safetensors=True, **{"low_cpu_mem_usage": True, "use_cache": False}, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Quantize
model.quantize(tokenizer, quant_config=quant_config)
# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.__get__(instance, owner)()
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00, 1.73s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Downloading readme: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 167/167 [00:00<00:00, 1.77MB/s]
/usr/local/lib/python3.10/dist-packages/huggingface_hub/repocard.py:105: UserWarning: Repo card metadata block was not found. Setting CardData to empty.
warnings.warn("Repo card metadata block was not found. Setting CardData to empty.")
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 471M/471M [00:53<00:00, 8.85MB/s]
Generating validation split: 214670 examples [00:03, 69675.70 examples/s]
Token indices sequence length is longer than the specified maximum sequence length for this model (98937 > 16384). Running this sequence through the model will result in indexing errors
AWQ: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [17:59<00:00, 33.74s/it]
Then I get the files like these:
root@2378096be2f6:/data/deepseek-6.7B-awq# ll
total 3800100
drwxr-xr-x 2 root root 4096 Mar 5 14:33 ./
drwxr-xr-x 19 root root 4096 Mar 5 14:33 ../
-rw-r--r-- 1 root root 916 Mar 5 14:33 config.json
-rw-r--r-- 1 root root 140 Mar 5 14:33 generation_config.json
-rw-r--r-- 1 root root 3889899416 Mar 5 14:33 model.safetensors
-rw-r--r-- 1 root root 735 Mar 5 14:33 special_tokens_map.json
-rw-r--r-- 1 root root 1370514 Mar 5 14:33 tokenizer.json
-rw-r--r-- 1 root root 6101 Mar 5 14:33 tokenizer_config.json
And the config.json is:
(base) [root@VM]# cat deepseek-6.7B-awq/config.json
{
"_name_or_path": "/data/deepseek-6.7B",
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 32013,
"eos_token_id": 32022,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_position_embeddings": 16384,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 32,
"pretraining_tp": 1,
"quantization_config": {
"bits": 4,
"group_size": 128,
"modules_to_not_convert": null,
"quant_method": "awq",
"version": "gemm",
"zero_point": true
},
"rms_norm_eps": 1e-06,
"rope_scaling": {
"factor": 4.0,
"type": "linear"
},
"rope_theta": 100000,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.37.2",
"use_cache": false,
"vocab_size": 32031
}
is only 18, the normal score is more than 40
This may be related to the fact that AWQ is meant for fast local inference, and not production. By quantizing 16bit floating values into 4bit integers, the quality will suffer accordingly. If you compare the GGUF and EXL2, it seems there is a similar quality loss at 4bit. This lets you preform functional testing on the model, before deploying the native version to production.
From my experience, the way AWQ quants, a 4bit AWQ will usually score better than a 4bit GGUF or EXL2, and inference is usually faster. When I want high quality (ready for production), I need to use the native fp16 or bfloat16.
Some models suffer with higher quantization errors than others. Unfortunately, I do not always have the time or hardware to keep supporting every model. There is now a new option on main branch where you can use apply_clip=False
, which some users reported helped them on Chinese models like QWen/DeepSeek.
I'm closing this issue for now as OOM is not a general issue for AutoAWQ when you follow the examples. I believe transformers has also introduced changes in 4.39 that stops these caching issues.
I use
A10 GPU
withAutoAWQ 0.2.2
, and the model isdeepseek-7B
。The command is:
And the error is:
I see the code, AutoAWQ does not support multi GPUs? https://github.com/casper-hansen/AutoAWQ/blob/68c727a1a338a1e8d988e8f6094e0d38040e0bb6/awq/utils/utils.py#L89
Is there any other way to solve this OOM problem?
Thanks.