Open wciq1208 opened 2 months ago
Oom happened in vllm. Can you post your vllm installed version and flashinfer version? Make sure they are the latest.
Oom happened in vllm. Can you post your vllm installed version and flashinfer version? Make sure they are the latest.
Name: vllm
Version: 0.5.3.post1
Summary: A high-throughput and memory-efficient inference and serving engine for LLMs
Home-page: https://github.com/vllm-project/vllm
Author: vLLM Team
Author-email:
License: Apache 2.0
Location: /opt/conda/lib/python3.11/site-packages
Requires: aiohttp, cmake, fastapi, filelock, lm-format-enforcer, ninja, numpy, nvidia-ml-py, openai, outlines, pillow, prometheus-client, prometheus-fastapi-instrumentator, psutil, py-cpuinfo, pydantic, pyzmq, ray, requests, sentencepiece, tiktoken, tokenizers, torch, torchvision, tqdm, transformers, typing-extensions, uvicorn, vllm-flash-attn, xformers
Required-by:
---
Name: flashinfer
Version: 0.1.1
Summary: FlashInfer: Kernel Library for LLM Serving
Home-page: https://github.com/flashinfer-ai/flashinfer
Author: FlashInfer team
Author-email:
License: Apache License 2.0
Location: /model/flashinfer/python
Editable project location: /model/flashinfer/python
Requires:
Required-by:
---
Name: vllm-flash-attn
Version: 2.5.9.post1
Summary: Forward-only flash-attn
Home-page: https://github.com/vllm-project/flash-attention.git
Author: vLLM Team
Author-email:
License:
Location: /opt/conda/lib/python3.11/site-packages
Requires: torch
Required-by: vllm
Describe the bug
gemma-2-9b-it-gptq-4bit CUDA OOM on RTX 3090
GPU Info
Software Info
If you are reporting an inference bug of a post-quantized model, please post the content of
config.json
andquantize_config.json
.model from https://huggingface.co/ModelCloud/gemma-2-9b-it-gptq-4bit
To Reproduce
Expected behavior
successed
Model/Datasets
Make sure your model/dataset is downloadable (on HF for example) so we can reproduce your issue.
Screenshots
Additional context