-
Hi there,
I was struggling on how to implement quantization on autoawq as you mentioned in home page. I was trying to quantize 7b qwen2 vl but no matter I use 2 A100 80Gb vram, I still get cuda oom…
-
We need a separate product quantization API that is decoupled from IVF but can still be composed into IVF.
Ideally this API would follow FAISS or Scikit-learn'a transformer estimators.
-
Similar to affine quantization, we can implement codebook or look up table based quantization, which is another popular type of quantization, especially for lower bits like 4 bits or below (used in ht…
-
Hi,
Have you tried quantizing Mamba? Do you plan on releasing quantized versions?
Can you share your thoughts on quantizing Mamba, given the sensitivity of the model's recurrent dynamics?
Thanks
-
As we have a few models with Half-Quadratic Quantization (HQQ) out there, VLLM should also support them:
```sh
api_server.py: error: argument --quantization/-q: invalid choice: 'hqq' (choose from …
-
### SDK
Python
### Description
- From https://huggingface.co/blog/embedding-quantization: _Binary and Scalar Embedding Quantization for Significantly Faster & Cheaper Retrieval_
- Also from https…
-
### Feature request
Too much boilerplate template:
Resolves loading, quantization, and device
Eg. if
device: auto -> torch.cuda.is_available() -> cuda or mps.
dtype: float32 -> float32, no q…
-
https://github.com/intel/neural-compressor/tree/master/examples/onnxrt/nlp/huggingface_model/text_generation/llama/quantization/weight_only
bash run_quant.sh --input_model=./Meta-Llama-3.1-8B -…
-
Dear author, when I reproduce the w4a4 quantization on Vicuna-7b-v1.5 on a single A800 by using the default parameters in run.sh,I got
```
***** 0-shot *****
***** MMLU_eval subcategories metrics …
-
### The quantization format
Hi all,
We have recently designed and open-sourced a new method for Vector Quantization called Vector Post-Training Quantization (VPTQ). Our work is available at [VPTQ…