-
### Proposal to improve performance
Hi~ I find the inference time of Qwen2-VL-7B AWQ is not improved too much compared to Qwen2-VL-7B. Do you have any suggestions about improving performance. Thank y…
-
### Checklist
- [X] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.…
-
-
git clone https://www.modelscope.cn/models/linglingdan/MiniCPM-V_2_6_awq_int4
用这个量化后的INT4模型推理,显存占用大概20G,和fp模型显存占用情况基本一样,请教下是不是量化存在问题?
-
### Your current environment
vllm==0.6.3.post1
### Model Input Dumps
```bash
ValueError: Weight input_size_per_partition = 10944 is not divisible by min_thread_k = 128. Consider reducing tensor_pa…
-
[AWQ](https://arxiv.org/pdf/2306.00978) seems popular: 3000 appearances in huggingface models: (https://huggingface.co/models?sort=trending&search=AWQ), similar to GPTQ. Maybe we can add this to torch…
-
### Model Series
Qwen2.5
### What are the models used?
Qwen2.5-72B-Instruct-AWQ和Qwen2.5-32B-Instruct-AWQ
### What is the scenario where the problem happened?
inference with transformers
### Is …
-
Hi, first of all congrats for the great work!
I wanted to ask why isn't there a more thorough comparison between AWQ and SmoothQuant in the paper. To my understanding, they both work using a simila…
-
what was the quantisation algorithm used in unsloth/Llama-3.2-1B-bnb-4bit model: https://huggingface.co/docs/transformers/main/en/quantization/overview. Is it int4_awq or int4_weightonly ?
-
### Checklist
- [X] 1. I have searched related issues but cannot get the expected help.
- [X] 2. The bug has not been fixed in the latest version.
- [X] 3. Please note that if the bug-related issue y…