-
### System Info
Ubuntu 20.04
NVIDIA A100
nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3 and 24.07
TensorRT-LLM v0.14.0 and v0.11.0
### Who can help?
@Tracin
### Information
- [x] The offici…
-
Hi,
I tried both the qwen2-vl-7b bf16 & awq and honestly I'm not seeing any speed improvement.
the awq is ~6GB however after running in vLLM it ends up taking the same space in vRAM eventually (~22G…
-
### System Info
x86_64, Debian 11, L4 GPU
### Who can help?
_No response_
### Information
- [ ] The official example scripts
- [ ] My own modified scripts
### Tasks
- [ ] An officially supporte…
-
Hi, thanks for your work. However, I cannot find the 'run_awq_llama.sh', am i missing sth?
-
I successfully quantized the mistralai/Mistral-Nemo-Instruct-2407 model to ONNX using the following command:
`python awq-quantized-model.py --model_path mistralai/Mistral-Nemo-Instruct-2407 --quant_p…
-
### Motivation
as we all know that lmdelopy runs fastest in awq w4a16, however, as fp8 is used in lots of place. so i wonder, if developers has any plan to develop a fastest w4a8-fp8 kernel in lmdepl…
-
### Describe the bug
I installed text generation webui and downloaded the model(TheBloke_Yarn-Mistral-7B-128k-AWQ) and I can't run it. I chose Transofmer as Model loader. I tried installing autoawq b…
-
Hi there,
I was struggling on how to implement quantization on autoawq as you mentioned in home page. I was trying to quantize 7b qwen2 vl but no matter I use 2 A100 80Gb vram, I still get cuda oom…
-
### Model Series
Qwen2.5
### What are the models used?
Qwen2.5-32B-Instruct-AWQ
### What is the scenario where the problem happened?
inference with vllm
### Is this a known issue?
- [X] I have …
-
https://github.com/intel/neural-compressor/blob/master/docs/source/quantization_weight_only.md#examples
how to set eval_func?
https://github.com/intel/neural-compressor/blob/master/examples/3…