casper-hansen / AutoAWQ

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
https://casper-hansen.github.io/AutoAWQ/
MIT License
1.67k stars 202 forks source link

OOM in A10 GPU with AutoAWQ 0.2.2 #382

Closed activezhao closed 5 months ago

activezhao commented 7 months ago

I use A10 GPU with AutoAWQ 0.2.2, and the model is deepseek-7B

The command is:

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = '/data/deepseek-6.7B-tencent'
quant_path = '/data/deepseek-6.7B-tencent-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path, **{"low_cpu_mem_usage": True})
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

And the error is:

Loading checkpoint shards:   0%|                                                                                                          | 0/2 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.61it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
/usr/local/lib/python3.10/dist-packages/huggingface_hub/repocard.py:105: UserWarning: Repo card metadata block was not found. Setting CardData to empty.
  warnings.warn("Repo card metadata block was not found. Setting CardData to empty.")
Token indices sequence length is longer than the specified maximum sequence length for this model (98937 > 16384). Running this sequence through the model will result in indexing errors
Traceback (most recent call last):
  File "/data/vllm_awq.py", line 13, in <module>
    model.quantize(tokenizer, quant_config=quant_config)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/awq/models/base.py", line 155, in quantize
    self.quantizer = AwqQuantizer(
  File "/usr/local/lib/python3.10/dist-packages/awq/quantize/quantizer.py", line 56, in __init__
    self.modules, self.module_kwargs, self.inps = self.init_quant()
  File "/usr/local/lib/python3.10/dist-packages/awq/quantize/quantizer.py", line 466, in init_quant
    self.model(samples.to(next(self.model.parameters()).device))
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 1168, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 982, in forward
    causal_mask = self._update_causal_mask(attention_mask, inputs_embeds)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 1072, in _update_causal_mask
    causal_mask = causal_mask.to(dtype=dtype, device=device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 33.00 GiB. GPU 0 has a total capacty of 22.20 GiB of which 21.06 GiB is free. Process 1957804 has 1.14 GiB memory in use. Of the allocated memory 900.52 MiB is allocated by PyTorch, and 3.48 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I see the code, AutoAWQ does not support multi GPUs? https://github.com/casper-hansen/AutoAWQ/blob/68c727a1a338a1e8d988e8f6094e0d38040e0bb6/awq/utils/utils.py#L89

Is there any other way to solve this OOM problem?

Thanks.

casper-hansen commented 7 months ago

Hi @activezhao, this looks like a transformers issue. They have been having issues with their cache ever since 4.36.0. The current workaround is to pass **{"use_cache": False} until they fix it.

activezhao commented 7 months ago

Hi @activezhao, this looks like a transformers issue. They have been having issues with their cache ever since 4.36.0. The current workaround is to pass **{"use_cache": False} until they fix it.

@casper-hansen OK,Thanks for your reply, I will just try it.

activezhao commented 7 months ago

Hi @activezhao, this looks like a transformers issue. They have been having issues with their cache ever since 4.36.0. The current workaround is to pass **{"use_cache": False} until they fix it.

@casper-hansen one more thing, will AutoAWQ support multi GPUs when quantization?

Thanks.

activezhao commented 7 months ago

Hi @activezhao, this looks like a transformers issue. They have been having issues with their cache ever since 4.36.0. The current workaround is to pass **{"use_cache": False} until they fix it.

Hi @casper-hansen I tried to add the parameter of **{"use_cache": False}, but the same error occured again.

Is there any other way?

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = '/data/deepseek-6.7B-tencent'
quant_path = '/data/deepseek-6.7B-tencent-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path, safetensors=True, **{"low_cpu_mem_usage": True, "use_cache": False})
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 33.00 GiB. GPU 0 has a total capacty of 22.20 GiB of which 21.06 GiB is free. Process 1957804 has 1.14 GiB memory in use. Of the allocated memory 900.52 MiB is allocated by PyTorch, and 3.48 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
casper-hansen commented 7 months ago

You should not be running into OOM issues with this configuration. You only need about 16GB of VRAM to fit a 7B model into memory. In my own testing, I have not ran into these issues, so I am unsure what is causing the issue for you

DreamGenX commented 6 months ago

Checkout this issue, which was closed as not related, but is probably the reason for your problems: https://github.com/casper-hansen/AutoAWQ/issues/372

activezhao commented 6 months ago

Checkout this issue, which was closed as not related, but is probably the reason for your problems: https://github.com/casper-hansen/AutoAWQ/issues/372

@DreamGenX OK, thanks

By the way, what's the version of your CUDA? My CUDA version is 12.3, too high?

TechxGenus commented 6 months ago

I encountered the same problem when quantifying deepseek-coder-1.3b, which ended up consuming 37G of VRAM. Both AutoAWQ and transformers are installed from source.

DreamGenX commented 6 months ago

I heard somewhere that this could also be due to large rope_theta. All of these models have rope_theta >>10000 (which if the common value for old llama2 models and the base mistral 7b).

activezhao commented 6 months ago

I encountered the same problem when quantifying deepseek-coder-1.3b, which ended up consuming 37G of VRAM. Both AutoAWQ and transformers are installed from source.

@TechxGenus Have u solved?

activezhao commented 6 months ago

I have solved the error, I change the transformers from 4.38.1 to 4.37.2, It works.

Thanks to all u guys.

activezhao commented 6 months ago

Hi @casper-hansen I deploy the awq model of deepseek-6.7b with vLLM, but the human eval score is only 18, the normal score is more than 40, is there something wrong with how I handled it?

Thanks.

The python file is:

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = '/data/deepseek-6.7B'
quant_path = '/data/deepseek-6.7B-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path, safetensors=True, **{"low_cpu_mem_usage": True, "use_cache": False}, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
Loading checkpoint shards:   0%|                                                                                                          | 0/2 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.73s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Downloading readme: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 167/167 [00:00<00:00, 1.77MB/s]
/usr/local/lib/python3.10/dist-packages/huggingface_hub/repocard.py:105: UserWarning: Repo card metadata block was not found. Setting CardData to empty.
  warnings.warn("Repo card metadata block was not found. Setting CardData to empty.")
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 471M/471M [00:53<00:00, 8.85MB/s]
Generating validation split: 214670 examples [00:03, 69675.70 examples/s]
Token indices sequence length is longer than the specified maximum sequence length for this model (98937 > 16384). Running this sequence through the model will result in indexing errors
AWQ: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [17:59<00:00, 33.74s/it]

Then I get the files like these:

root@2378096be2f6:/data/deepseek-6.7B-awq# ll
total 3800100
drwxr-xr-x  2 root root       4096 Mar  5 14:33 ./
drwxr-xr-x 19 root root       4096 Mar  5 14:33 ../
-rw-r--r--  1 root root        916 Mar  5 14:33 config.json
-rw-r--r--  1 root root        140 Mar  5 14:33 generation_config.json
-rw-r--r--  1 root root 3889899416 Mar  5 14:33 model.safetensors
-rw-r--r--  1 root root        735 Mar  5 14:33 special_tokens_map.json
-rw-r--r--  1 root root    1370514 Mar  5 14:33 tokenizer.json
-rw-r--r--  1 root root       6101 Mar  5 14:33 tokenizer_config.json

And the config.json is:

(base) [root@VM]# cat deepseek-6.7B-awq/config.json 
{
  "_name_or_path": "/data/deepseek-6.7B",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 32013,
  "eos_token_id": 32022,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 16384,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pretraining_tp": 1,
  "quantization_config": {
    "bits": 4,
    "group_size": 128,
    "modules_to_not_convert": null,
    "quant_method": "awq",
    "version": "gemm",
    "zero_point": true
  },
  "rms_norm_eps": 1e-06,
  "rope_scaling": {
    "factor": 4.0,
    "type": "linear"
  },
  "rope_theta": 100000,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.37.2",
  "use_cache": false,
  "vocab_size": 32031
}
suparious commented 6 months ago

is only 18, the normal score is more than 40

This may be related to the fact that AWQ is meant for fast local inference, and not production. By quantizing 16bit floating values into 4bit integers, the quality will suffer accordingly. If you compare the GGUF and EXL2, it seems there is a similar quality loss at 4bit. This lets you preform functional testing on the model, before deploying the native version to production.

From my experience, the way AWQ quants, a 4bit AWQ will usually score better than a 4bit GGUF or EXL2, and inference is usually faster. When I want high quality (ready for production), I need to use the native fp16 or bfloat16.

casper-hansen commented 5 months ago

Some models suffer with higher quantization errors than others. Unfortunately, I do not always have the time or hardware to keep supporting every model. There is now a new option on main branch where you can use apply_clip=False, which some users reported helped them on Chinese models like QWen/DeepSeek.

casper-hansen commented 5 months ago

I'm closing this issue for now as OOM is not a general issue for AutoAWQ when you follow the examples. I believe transformers has also introduced changes in 4.39 that stops these caching issues.