ModelCloud / GPTQModel

GPTQ based LLM model compression/quantization toolkit with accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.
Apache License 2.0
109 stars 24 forks source link

[BUG] gemma-2-9b-it-gptq-4bit vllm oom #331

Open wciq1208 opened 2 months ago

wciq1208 commented 2 months ago

Describe the bug

gemma-2-9b-it-gptq-4bit CUDA OOM on RTX 3090

GPU Info

Sun Aug  4 02:35:35 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        Off | 00000000:00:0D.0 Off |                  N/A |
| 30%   34C    P8              30W / 350W |      2MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        Off | 00000000:00:0F.0 Off |                  N/A |
| 30%   34C    P8              22W / 350W |      2MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Software Info

Name: gptqmodel
Version: 0.9.10.dev0+cu1211
Summary: A LLM quantization package with user-friendly apis. Based on GPTQ algorithm.
Home-page: https://github.com/ModelCloud/GPTQModel
Author: ModelCloud
Author-email: 
License: 
Location: /opt/conda/lib/python3.11/site-packages
Requires: accelerate, auto-round, datasets, gekko, huggingface-hub, intel-extension-for-transformers, ninja, numpy, packaging, protobuf, rouge, safetensors, sentencepiece, threadpoolctl, torch, tqdm, transformers, triton
Required-by: 
---
Name: torch
Version: 2.3.1
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /opt/conda/lib/python3.11/site-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-nccl-cu12, nvidia-nvtx-cu12, sympy, triton, typing-extensions
Required-by: accelerate, auto-round, auto_gptq, gptqmodel, peft, torchaudio, torchelastic, torchvision, vllm, vllm-flash-attn, xformers
---
Name: transformers
Version: 4.43.3
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: /opt/conda/lib/python3.11/site-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: auto-round, auto_gptq, gptqmodel, intel-extension-for-transformers, peft, vllm
---
Name: accelerate
Version: 0.33.0
Summary: Accelerate
Home-page: https://github.com/huggingface/accelerate
Author: The HuggingFace team
Author-email: zach.mueller@huggingface.co
License: Apache
Location: /opt/conda/lib/python3.11/site-packages
Requires: huggingface-hub, numpy, packaging, psutil, pyyaml, safetensors, torch
Required-by: auto-round, auto_gptq, gptqmodel, peft
---
Name: triton
Version: 2.3.1
Summary: A language and compiler for custom Deep Learning operations
Home-page: https://github.com/openai/triton/
Author: Philippe Tillet
Author-email: phil@openai.com
License: 
Location: /opt/conda/lib/python3.11/site-packages
Requires: filelock
Required-by: gptqmodel, torch

If you are reporting an inference bug of a post-quantized model, please post the content of config.json and quantize_config.json.

model from https://huggingface.co/ModelCloud/gemma-2-9b-it-gptq-4bit

To Reproduce

import os
# Gemma-2 use Flashinfer backend for models with logits_soft_cap. Otherwise, the output might be wrong.
os.environ['VLLM_ATTENTION_BACKEND'] = 'FLASHINFER'

from transformers import AutoTokenizer
from gptqmodel import BACKEND, GPTQModel

model_name = "/model/gemma-2-9b-it-gptq-4bit"

prompt = [{"role": "user", "content": "I am in Shanghai, preparing to visit the natural history museum. Can you tell me the best way to"}]

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = GPTQModel.from_quantized(
            model_name,
            backend=BACKEND.VLLM,
        )

inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
outputs = model.generate(prompts=inputs, temperature=0.95, max_length=128)
print(outputs[0].outputs[0].text)

Expected behavior

successed

Model/Datasets

Make sure your model/dataset is downloadable (on HF for example) so we can reproduce your issue.

Screenshots

image

Additional context

Qubitium commented 2 months ago

Oom happened in vllm. Can you post your vllm installed version and flashinfer version? Make sure they are the latest.

wciq1208 commented 2 months ago

Oom happened in vllm. Can you post your vllm installed version and flashinfer version? Make sure they are the latest.

Name: vllm
Version: 0.5.3.post1
Summary: A high-throughput and memory-efficient inference and serving engine for LLMs
Home-page: https://github.com/vllm-project/vllm
Author: vLLM Team
Author-email: 
License: Apache 2.0
Location: /opt/conda/lib/python3.11/site-packages
Requires: aiohttp, cmake, fastapi, filelock, lm-format-enforcer, ninja, numpy, nvidia-ml-py, openai, outlines, pillow, prometheus-client, prometheus-fastapi-instrumentator, psutil, py-cpuinfo, pydantic, pyzmq, ray, requests, sentencepiece, tiktoken, tokenizers, torch, torchvision, tqdm, transformers, typing-extensions, uvicorn, vllm-flash-attn, xformers
Required-by: 
---
Name: flashinfer
Version: 0.1.1
Summary: FlashInfer: Kernel Library for LLM Serving
Home-page: https://github.com/flashinfer-ai/flashinfer
Author: FlashInfer team
Author-email: 
License: Apache License 2.0
Location: /model/flashinfer/python
Editable project location: /model/flashinfer/python
Requires: 
Required-by: 
---
Name: vllm-flash-attn
Version: 2.5.9.post1
Summary: Forward-only flash-attn
Home-page: https://github.com/vllm-project/flash-attention.git
Author: vLLM Team
Author-email: 
License: 
Location: /opt/conda/lib/python3.11/site-packages
Requires: torch
Required-by: vllm