-
### Your current environment
vllm 0.5.4
### 🐛 Describe the bug
autoawq marlin must with no zero point, but vllm:
```python
def query_marlin_supported_quant_types(has_zp: bool,
…
-
I use awq to quantize llama 2 70b-chat by:
```
CUDA_VISIBLE_DEVICES="1,2,3,4,5,6,7" python quantize_llama.py
```
the codes of quantize_llama.py:
```
from awq import AutoAWQForCausalLM
from tr…
-
Hi TensorRT-LLM team, Your work is incredible.
By following the READme file for [multi-modeling](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/multimodal/README.md), we were sucess to run…
-
In reviewing the updated ```docs``` I notice a few things that prompted some questions...
1) Neither AWQ/Int-4/```int32_float16``` are mentioned in the "Quantize on model conversion" nor "Quantize…
-
### Your current environment
...
### How would you like to use vllm
I have downloaded a model. Now on my 4 GPU instance I attempt to quantize it using AutoAWQ.
Whenever I run the script below I ge…
-
### System Info
```shell
Name: optimum
Version: 1.18.0.dev0
Name: transformers
Version: 4.36.0
Name: auto-gptq
Version: 0.6.0.dev0+cu118
CUDA Version: 11.8
Python 3.8.17
```
### Who can help…
-
I have downloaded a model. Now on my 4 GPU instance I attempt to quantize it using AutoAWQ.
Whenever I run the script below I get 0% GPU utilization.
Can anyone assist why can this be happening?
…
-
AutoGPTQForCausalLM.from_quantized 加载官方4bit量化模型([Llama2-Chinese-13b-Chat-4bit](https://huggingface.co/FlagAlpha/Llama2-Chinese-13b-Chat-4bit/tree/main))报错:NameError: name 'autogptq_cuda_256' is not de…
-
### System Info
I am using a Tesla T4 16 gb
### Reproduction
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
base_model_id = "mistralai/Mistral-7B-…
-
would be really nice to have a functionary version of llama 3.1 70b/8b!