-
Similar to #1252 , do we have any plans for supporting V100. For now I can see that the place need to be modified is ldmatrix instruction and m16n8k16, as an example we may need to load the matrix man…
-
### Feature request
Too much boilerplate template:
Resolves loading, quantization, and device
Eg. if
device: auto -> torch.cuda.is_available() -> cuda or mps.
dtype: float32 -> float32, no q…
-
### Is there an existing issue for this problem?
- [X] I have searched the existing issues
### Operating system
Windows
### GPU vendor
Nvidia (CUDA)
### GPU model
rtx 4090
### GPU VRAM
24g
#…
-
There are several experiments being done with this repo to understand and evaluate the effects of quantization on the `llama2.c` models.
It is a great test-bed to analyze the effects of varying app…
-
### 💡 Your Question
I have followed exactly same steps for model training followed by PTQ and QAT mentioned in the offcial super-gradient notebook :
https://github.com/Deci-AI/super-gradients/blob…
-
### The quantization format
Hi all,
We have recently designed and open-sourced a new method for Vector Quantization called Vector Post-Training Quantization (VPTQ). Our work is available at [VPTQ…
-
### SDK
Python
### Description
- From https://huggingface.co/blog/embedding-quantization: _Binary and Scalar Embedding Quantization for Significantly Faster & Cheaper Retrieval_
- Also from https…
-
### 🚀 The feature, motivation and pitch
I am trying to implement eager mode of PT2E quantization on CPU. Currently, the PT2E quantization on CPU is lowered to Inductor by `torch.compile`. The current…
-
We currently only support 4-bit quantization via BitsAndBytes. We should support other options such as 8-bit, (potentially) 6-bit, etc.
-
**Describe the bug**
When using the preset W8A8 recipe from llm-compressor, the output results in a model config.json that fails validation when loaded by HF Transformers. This is a dev version of Tr…