-
Thanks for the excellent work!
I use examples/basic_quant_mix.py to quantize the Qwen2-7B model with --w_bit 8. It's very strange that the quantized model is even larger than the original model.
…
-
**Describe the bug**
When using the preset W8A8 recipe from llm-compressor, the output results in a model config.json that fails validation when loaded by HF Transformers. This is a dev version of Tr…
-
It can train the ViT model from the Hugging Face transformer,
but when converting to tflite model it appear an error message that I can't solve it.
The following are the tinynn setting and the error…
-
I found a [similar closed issue](https://github.com/microsoft/VPTQ/issues/56) related to this topic. Following your reply in that issue, I successfully configured the `vptq-algo` environment based on …
-
I'm currently using H800 to do Smooth Quantization for my custom flux transformer. I'm wondering how long it would take to finish quantization. I have been quantizing for 20 minutes, but the progress …
-
### System Info / 系統信息
SERVER:Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
PRETTY_NAME:"Debian GNU/Linux 11 (bullseye)"
python:3.11.5
conda:23.10.0
torch:2.4.1+cpu
### Running Xinference with D…
-
Quantization on GPU works as expected with very small errors, but on CPU there seems to be a problem with the quantized model's output. Here is the code to replicate the problem.
```py
import torc…
-
### Add Hardware Compatibility Check for FP8 Quantization
#### Issue Summary
In our current implementation, we provide three APIs for model computation in FP8 format. However, for dynamic activati…
-
Hi,
I ran the hello world example quantization script and it seems to increase the model size. This does not occue with pete wardens original notebook. He uses tensorflow 2.0.0. Using the 2.18.0 in…
-
Is the full model needed before adding the quantization? It would be nice if it wasn't but maybe it's hard to avoid.
At the moment the full model is downloaded when the pipeline is loading even tho…