-
Many papers have recently addressed the issues with quantization of activations for LLMs.
Examples:
https://github.com/ziplab/QLLM?tab=readme-ov-file#%F0%9F%9B%A0-install
https://github.com/mit-h…
-
-
**Describe the bug**
When using the preset W8A8 recipe from llm-compressor, the output results in a model config.json that fails validation when loaded by HF Transformers. This is a dev version of Tr…
-
For large models (Parameter > 350B),weight can't be loaded with single node (ex: 80GB X 8).
Although methods like cpu/disk offloading can overcome the limits of GPU memory, but the quantization spee…
-
Running ``quantize.py`` with ``--mode int4-gptq`` does not seem to work:
- code tries to import ``lm-evaluation-harness`` which is not included/documented/used
- import in ``eval.py`` is incorrect…
-
I'm new to this specific project, and I don't say any of the following with high confidence.
Things that I see as important for quantization:
*Inference speed*
- AWQ seems best on this front, t…
-
Hi. Is there any support for converting the YoloV8-seg model to INT8 precision and using it with Deepstream?
-
Hi all,
We've recently open-sourced a new quantization method. VPTQ (Vector Post-Training Quantization) is a novel Post-Training Quantization method that leverages Vector Quantization to achieve hi…
-
### System Info / 系統信息
torch 2.5.1+cu121
diffusers 0.31.0
torchao 0.7.0+cpu
Python 3.11.10
Windows 11
### Information / 问题信息
- [X] The official example scr…
-
### Your current environment
"""
This example shows how to use LoRA with different quantization techniques
for offline inference.
Requires HuggingFace credentials for access.
"""
import gc
…