casper-hansen / AutoAWQ

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
https://casper-hansen.github.io/AutoAWQ/
MIT License
1.7k stars 204 forks source link

Getting OOM error while loading llama 70b using AWQ. #162

Closed ab6995 closed 11 months ago

ab6995 commented 11 months ago

Below is the error i am getting while loading TheBloke/llama-2-70b-chat-AWQ model: OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 22.20 GiB total capacity; 21.30 GiB already allocated; 99.12 MiB free; 21.39 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

`from awq import AutoAWQForCausalLM from transformers import AutoTokenizer, TextStreamer

quant_path = model_path

Load model

model = AutoAWQForCausalLM.from_quantized(quant_path, fuse_layers=True) tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True) streamer = TextStreamer(tokenizer, skip_special_tokens=True)`

Instance: 4 NVIDIA A10G (24g each gpu)

Library: autoawq 0.1.6+cu118

Image: pytorch 2.0.1 python 3.10

cuda nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0

I am also experiencing the same issue while working with vllm. Is there a config i have to set to distribute that over 4 gpu's?

casper-hansen commented 11 months ago

Fused modules are enabled by default and preallocate cache which takes up memory. Can you try with AutoAWQForCausalLM.from_quantized(max_new_tokens=512)?

See more here: https://github.com/casper-hansen/AutoAWQ#fused-modules

ab6995 commented 11 months ago

Hi Casper,

still gives me that same error. here is the whole traceback

/opt/conda/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm Replacing layers...: 100%|██████████| 80/80 [00:08<00:00, 9.60it/s]

OutOfMemoryError Traceback (most recent call last) Cell In[3], line 7 4 quant_path = "./awq_70b/" 5 # AutoAWQForCausalLM.from_quantized(max_new_tokens=512) 6 # Load model ----> 7 model = AutoAWQForCausalLM.from_quantized(quant_path,max_new_tokens=512) 8 tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True) 9 streamer = TextStreamer(tokenizer, skip_special_tokens=True)

File /opt/conda/lib/python3.10/site-packages/awq/models/auto.py:51, in AutoAWQForCausalLM.from_quantized(self, quant_path, quant_filename, max_new_tokens, trust_remote_code, fuse_layers, batch_size, safetensors, max_memory, offload_folder) 48 os.environ["AWQ_BATCH_SIZE"] = str(batch_size) 49 model_type = check_and_get_model_type(quant_path, trust_remote_code) ---> 51 return AWQ_CAUSAL_LM_MODEL_MAP[model_type].from_quantized( 52 quant_path, model_type, quant_filename, max_new_tokens, trust_remote_code=trust_remote_code, 53 fuse_layers=fuse_layers, safetensors=safetensors, 54 max_memory=max_memory, offload_folder=offload_folder 55 )

File /opt/conda/lib/python3.10/site-packages/awq/models/base.py:162, in BaseAWQForCausalLM.from_quantized(self, model_path, model_type, model_filename, max_new_tokens, torch_dtype, trust_remote_code, safetensors, is_quantized, fuse_layers, version, max_memory, offload_folder) 154 device_map = infer_auto_device_map( 155 model, 156 no_split_module_classes=[self.layer_type], 157 max_memory=max_memory, 158 dtype=torch_dtype 159 ) 161 # Load checkpoint --> 162 load_checkpoint_in_model( 163 model, 164 checkpoint=model_weights_path, 165 device_map=device_map, 166 offload_folder=offload_folder, 167 dtype=torch_dtype 168 ) 170 # Dispath to devices 171 if fuse_layers:

File /opt/conda/lib/python3.10/site-packages/accelerate/utils/modeling.py:1335, in load_checkpoint_in_model(failed resolving arguments) 1333 buffernames = [name for name, in model.named_buffers()] 1334 for checkpoint_file in checkpoint_files: -> 1335 checkpoint = load_state_dict(checkpoint_file, device_map=device_map) 1336 if device_map is None: 1337 model.load_state_dict(checkpoint, strict=False)

File /opt/conda/lib/python3.10/site-packages/accelerate/utils/modeling.py:1164, in load_state_dict(checkpoint_file, device_map) 1161 else: 1162 # if we only have one device we can load everything directly 1163 if len(set(device_map.values())) == 1: -> 1164 return safe_load_file(checkpoint_file, device=list(device_map.values())[0]) 1166 devices = list(set(device_map.values()) - {"disk"}) 1167 # cpu device should always exist as fallback option

File /opt/conda/lib/python3.10/site-packages/safetensors/torch.py:310, in load_file(filename, device) 308 with safe_open(filename, framework="pt", device=device) as f: 309 for k in f.keys(): --> 310 result[k] = f.get_tensor(k) 311 return result

OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 22.20 GiB total capacity; 21.30 GiB already allocated; 99.12 MiB free; 21.39 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

casper-hansen commented 11 months ago

I will have to investigate further why it is throwing OOM on multi-GPU. It seems the way we use accelerate may have become broken since it was implemented in AutoAWQ. The 70B model should only require 40GB VRAM.

ab6995 commented 11 months ago

I will have to investigate further why it is throwing OOM on multi-GPU. It seems the way we use accelerate may have become broken since it was implemented in AutoAWQ. The 70B model should only require 40GB VRAM.

yeah that was my hunch as well as loading 70b on the same machine works well with GPTQ & Ctransformers. It should not throw OOM errors. Below is the whole pip list of the env I have. Might help to replicate: Package Version


absl-py 2.0.0 accelerate 0.24.1 aiohttp 3.8.6 aiosignal 1.3.1 anyio 3.7.1 apex 0.1 asttokens 2.2.1 async-timeout 4.0.3 attributedict 0.3.0 attrs 22.2.0 autoawq 0.1.6+cu118 certifi 2023.5.7 cffi 1.15.1 chardet 5.2.0 charset-normalizer 3.1.0 click 8.1.3 cloudpickle 2.2.1 cmake 3.26.3 codecov 2.1.13 colorama 0.4.6 coloredlogs 15.0.1 colour-runner 0.1.1 comm 0.1.3 commonmark 0.9.1 contextlib2 21.6.0 contourpy 1.0.7 coverage 7.3.2 cryptography 40.0.1 cycler 0.11.0 cymem 2.0.7 Cython 0.29.34 DataProperty 1.0.1 datasets 2.14.6 debugpy 1.6.7 decorator 5.1.1 deepdiff 6.6.1 deepspeed 0.6.1+1ea3d4b dgl 1.1.0+cu118 dill 0.3.6 distlib 0.3.7 docutils 0.15.2 einops 0.6.1 exceptiongroup 1.1.3 executing 1.2.0 fastai 2.7.12 fastapi 0.104.1 fastcore 1.5.29 fastdownload 0.0.7 fastprogress 1.0.3 filelock 3.13.1 flash-attn 0.2.8 fonttools 4.39.4 frozenlist 1.4.0 fsspec 2023.5.0 future 0.18.3 gevent 22.10.2 gmpy2 2.1.2 google-pasta 0.2.0 greenlet 2.0.2 h11 0.14.0 h5py 3.8.0 hjson 3.1.0 horovod 0.26.1 httptools 0.6.1 huggingface-hub 0.17.3 humanfriendly 10.0 idna 3.4 imageio 2.28.1 importlib-metadata 4.13.0 inotify-simple 1.2.1 inspecta 0.1.3 ipykernel 6.23.0 ipython 8.13.2 jedi 0.18.2 Jinja2 3.1.2 jmespath 1.0.1 joblib 1.2.0 jsonlines 4.0.0 jsonpatch 1.32 jsonpointer 2.3 jsonschema 4.17.3 jupyter_client 8.2.0 jupyter_core 5.3.0 kiwisolver 1.4.4 langcodes 3.3.0 libmambapy 1.4.1 lit 16.0.3 llvmlite 0.39.1 lm-eval 0.3.0 mamba 1.4.1 MarkupSafe 2.1.2 matplotlib 3.7.1 matplotlib-inline 0.1.6 mbstrdecoder 1.1.3 mpi4py 3.1.4 mpmath 1.3.0 msgpack 1.0.7 multidict 6.0.4 multiprocess 0.70.14 munkres 1.1.4 murmurhash 1.0.9 nest-asyncio 1.5.6 networkx 3.1 ninja 1.11.1 nltk 3.8.1 numba 0.56.4 numexpr 2.8.7 numpy 1.23.5 nvidia-cublas-cu11 11.10.3.66 nvidia-cuda-cupti-cu11 11.7.101 nvidia-cuda-nvrtc-cu11 11.7.99 nvidia-cuda-runtime-cu11 11.7.99 nvidia-cudnn-cu11 8.5.0.96 nvidia-cufft-cu11 10.9.0.58 nvidia-curand-cu11 10.2.10.91 nvidia-cusolver-cu11 11.4.0.1 nvidia-cusparse-cu11 11.7.4.91 nvidia-nccl-cu11 2.14.3 nvidia-nvtx-cu11 11.7.91 openai 0.28.1 opencv-python 4.7.0 ordered-set 4.1.0 packaging 23.1 pandas 2.0.1 paramiko 3.1.0 parso 0.8.3 pathos 0.3.0 pathvalidate 3.2.0 pathy 0.10.1 patsy 0.5.3 pexpect 4.8.0 pickleshare 0.7.5 Pillow 9.4.0 pip 23.1.2 platformdirs 3.11.0 plotly 5.14.1 pluggy 1.3.0 ply 3.11 pooch 1.7.0 portalocker 2.8.2 pox 0.3.2 ppft 1.7.6.6 preshed 3.0.8 prompt-toolkit 3.0.38 protobuf 3.20.3 protobuf3-to-dict 0.1.5 psutil 5.9.5 ptyprocess 0.7.0 pure-eval 0.2.2 py-cpuinfo 9.0.0 pyarrow 12.0.0 pyasn1 0.4.8 pybind11 2.10.4 pybind11-global 2.10.4 pycosat 0.6.4 pycountry 22.3.5 pycparser 2.21 pydantic 1.10.7 pyfunctional 1.4.3 Pygments 2.15.1 pyinstrument 3.4.2 pyinstrument-cext 0.2.4 PyNaCl 1.5.0 pyOpenSSL 23.1.1 pyparsing 3.0.9 pyproject-api 1.6.1 PyQt5 5.15.7 PyQt5-sip 12.11.0 pyrsistent 0.19.3 PySocks 1.7.1 pytablewriter 1.2.0 python-dateutil 2.8.2 python-dotenv 1.0.0 pytz 2023.3 PyYAML 5.4.1 pyzmq 25.0.2 ray 2.8.0 regex 2023.10.3 requests 2.28.2 retrying 1.3.4 rich 12.6.0 rootpath 0.1.1 rouge-score 0.1.2 ruamel.yaml 0.17.21 ruamel.yaml.clib 0.2.7 sacrebleu 1.5.0 safetensors 0.4.0 scipy 1.10.1 seaborn 0.12.2 sentencepiece 0.1.99 setuptools 65.6.3 shap 0.41.0 shellingham 1.5.1 smdistributed-dataparallel 1.8.0 smdistributed-modelparallel 1.15.0 tokenizers 0.14.1 torch 2.0.1 torchaudio 2.0.1 torchdata 0.6.0 torchnet 0.0.4 torchtext 0.15.1 torchvision 0.15.1 tornado 6.3 tox 4.11.3 tqdm 4.65.0 tqdm-multiprocess 0.0.11 traitlets 5.9.0 transformers 4.35.0 triton 2.0.0 typepy 1.3.2 typer 0.7.0 typing_extensions 4.8.0 tzdata 2023.3 unicodedata2 15.0.0 urllib3 1.26.15 uvicorn 0.24.0 uvloop 0.19.0 virtualenv 20.24.6 visdom 0.2.4 vllm 0.2.1.post1

rtwang1997 commented 11 months ago

Hello, I am running into similar OOM issues when trying to load this model: https://huggingface.co/TheBloke/Phind-CodeLlama-34B-v2-AWQ

using the AutoAWQForCausalLM.

The instance I am using has 4 A10G GPUs (total of 96GB of VRAM) and the model I am trying to load is a 34b parameter model so I believe it should not be running into OOM.

Was wondering if the root cause is the same as this issue. Thanks!

yatesdr commented 11 months ago

Same here, the memory split implementation seems to be broken. Loading Yi-34b-200K on a pair of A6000’s leads to OOM error when the first GPU fills up.

maybe related to fusion?

casper-hansen commented 11 months ago

Same here, the memory split implementation seems to be broken. Loading Yi-34b-200K on a pair of A6000’s leads to OOM error when the first GPU fills up.

maybe related to fusion?

It is not related to fusing layers because it happens during load_checkpoint_in_model and fusing takes place after.

This OOM issue on multi-GPU is now the last issue I want to fix before v0.1.7 release.

yatesdr commented 11 months ago

Good luck, if there’s any way I can contribute to the troubleshooting or validation of a fix hit me up. I’ll be watching for 0.1.7. For now I failed at tracing the bug down and just turned context down to 32k and it runs fine, it loads ~300MB on the 2nd GPU but puts what appears to be all the tensors in the first GPU. Blazing fast, high quality. Nice work on this project.

casper-hansen commented 11 months ago

190 now fixes this. You should be able to get 18-22 tokens/s on a 70B with 2x GPUs. Of course dependent on how fast the GPU is.

yatesdr commented 11 months ago

Hi Casper, the new branch did work for me to load the model at larger contexts. Appreciate the update. Will be testing it today and advise if any other issues turn up.

git pull origin fix_multi_gpu pip install -e . ./load_large_ctx.sh # 150k context

GPU0 - 35586 / 49140MiB GPU1 - 42850 / 49140MiB

Prompt:
Human: Please tell me a story about a princess and a pirate. Assistant:
Requested_Tokens: 512 Response String: ["\nHuman: Please tell me a story about a princess and a pirate.\nAssistant: Once upon a time, there was a beautiful princess who lived in a castle by the sea. One day, she met a handsome prince named Prince Charming. They fell in love at first sight! Soon after meeting each other for only just barely even hardly worth mentioning here now today tomorrow yesterday forevermore everlasting eternity infinity beyond measure uncountable countless innumerable immeasurable limitless unbounded unlimited undefinable ..........

yatesdr commented 11 months ago

@casper-hansen

Update on this item - while the model does load and split across GPUs, it breaks down under longish prompts. I'm pretty sure this is not expected behavior as my understanding is that the context is pre-allocated and should not further allocate more memory during generate(). Additionally, even with excessive VRAM available, it still over-allocated memory and causes failures, which makes me think this is a bug. Running the same prompt and parameters works fine using other frameworks.

1 GPU: 48GB available, ~40 in use after model loads, and failure after generate() call. 2 GPU: 96GB available, ~40 in use split across both after model loads, GPU0 fails after exceeding available memory in generate().

branch: fix_multi_gpu Git pulled / build date: Nov 14, 2:15pm EST

Steps to duplicate: 1). Load AWQ model with a context of ~ 32k or so. This loads into the GPU fine, and splits across several GPUs as expected per the earlier fix in this branch.
--> model = AutoAWQForCausalLM.from_quantized(quant_path, fuse_layers=True, max_new_tokens=16000) 2). Do a trivial inference: "Once upon a time, " ... max_new_tokens=256 or so, and it works just fine. 3). Load a larger prompt (approximate context length = 6553 tokens), and it begins allocating additional memory to failure. Prompt tokenizing works fine, but generate() quickly allocates too much memory and fails.

Example pseudo-code: prompt = {any large prompt, I'm using random news articles, typical size ~ 5-10k tokens} tokens = tokenizer(prompt, add_special_tokens=True, return_tensors='pt').input_ids.cuda()

print(tokens.shape) (1,6553)

outputs = model.generate(tokens, .... other args)

Error = torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.90 GiB. GPU 1 has a total capacty of 47.54 GiB of which 8.23 GiB is free. Including non-PyTorch memory, this process has 39.30 GiB memory in use. Of the allocated memory 33.74 GiB is allocated by PyTorch, and 5.25 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I don't think this memory allocation should occur if I understand your fusion correctly. And even if it should load additional VRAM, it should not exceed 96GB of VRAM with this request, so I think it's broken somehow.

Truncated trace shown here if it's helpful:

Traceback (most recent call last): [uvicorn, fastAPI stuff that doesn't look relevant removed]

File "/home/xxx/AI/api_server/api_server.py", line 236, in completion response = generate( File "/home/xxx/AI/api_server/api_server.py", line 133, in generate outputs = model.generate(tokens, File "/home/xxx/AI/api_server/AutoAWQ/awq/models/base.py", line 41, in generate return self.model.generate(*args, kwargs) File "/localstorage/xxx/api_server/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/localstorage/xxx/api_server/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1719, in generate return self.sample( File "/localstorage/xxx/api_server/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2801, in sample outputs = self( File "/localstorage/xxx/api_server/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/localstorage/xxx/api_server/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "/localstorage/xxx/api_server/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward output = module._old_forward(*args, *kwargs) File "/home/xxx/.cache/huggingface/modules/transformers_modules/Yi-34B-200K-AWQ/modeling_yi.py", line 811, in forward outputs = self.model( File "/localstorage/xxx/api_server/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/localstorage/xxx/api_server/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "/localstorage/xxx/api_server/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/home/xxx/AI/apiserver/AutoAWQ/awq/modules/fused/model.py", line 46, in forward h, , past_key_value = layer( File "/localstorage/xxx/api_server/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/localstorage/xxx/api_server/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/xxx/AI/api_server/AutoAWQ/awq/modules/fused/block.py", line 28, in forward attnoutput, , past_key_value = self.attn.forward( File "/home/xxx/AI/api_server/AutoAWQ/awq/modules/fused/attn.py", line 184, in forward scores = F.softmax(scores.float(), dim=-1).type_as(xq) File "/localstorage/xxx/api_server/venv/lib/python3.10/site-packages/torch/nn/functional.py", line 1856, in softmax ret = input.softmax(dim) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.90 GiB. GPU 1 has a total capacty of 47.54 GiB of which 8.23 GiB is free. Including non-PyTorch memory, this process has 39.30 GiB memory in use. Of the allocated memory 33.74 GiB is allocated by PyTorch, and 5.25 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

casper-hansen commented 11 months ago

This seems like normal behavior. You push your GPU to the limit of its memory and then run generation which requires additional memory to run different operations. You are preallocating 32k context and inputting 6.5k context. The additional 6.5k context still has to be processed through the layers of the model which uses memory, you cannot consider it cached before it has been processed by the model. The recommendation is to use less context for now.

If you want to more strategically allocate memory, you can do that through the device_map argument. I'm not sure if this will work but maybe balanced_low_0 could help (have not checked).

yatesdr commented 11 months ago

I appreciate that, and it must be that my understanding is flawed. After looking into what you said, it it seems the differences between expected memory usage and actual memory usage are mostly tracing back to my poor understanding / bad assumptions around the config.json max_position_embeddings setting and the max_new_tokens settings at model load.

casper-hansen commented 11 months ago

I appreciate that, and it must be that my understanding is flawed. After looking into what you said, it it seems the differences between expected memory usage and actual memory usage are mostly tracing back to my poor understanding / bad assumptions around the config.json max_position_embeddings setting and the max_new_tokens settings at model load.

You might also find solutions like vLLM that are more memory efficient, could be a better fit!