-
### Describe the issue
Reduce-range does not improve the metric
### To reproduce
I'm using the reduce-range feature. Quantization is calculated symmetrically, in QDQ format, for int8.
But…
-
Multi-vector MaxSim is increasingly important and we have optimizations for float cell precision, but I think we should also consider optimize for int8 with hamming as it approximates the dotproduct f…
-
I am attempting to emit pytorch code but unfortunately it does not work for fp8, bf16, and int8. I have tried to patch the converter type dict https://github.com/OrenLeung/cutlass/commit/6d619c964eb8b…
-
Hi, can you share best practices for quantization for CNN models?
Are the modelopt quantized PTQ is the way to go with tensorrt for cnn models (resnet retinanet etc)? I was able to quantize retinanet…
-
### 🚀 The feature, motivation and pitch
**Feature motivation:**
[Default pyTorch quantization aware training](https://pytorch.org/docs/stable/quantization.html) uses "fake-quantization" approach. Fo…
-
### Feature request / 功能建议
当前vllm sglang引擎下Qwen系列暂无法使用int4 int8量化
### Motivation / 动机
当前vllm sglang引擎下Qwen系列暂无法使用int4 int8量化
### Your contribution / 您的贡献
none
-
hello!
I build int8 weights:
INFERENCE_PRECISION=float16
WEIGHT_ONLY_PRECISION=int8
MAX_BEAM_WIDTH=4
MAX_BATCH_SIZE=8
checkpoint_dir=whisper_large_v3_weights_${WEIGHT_ONLY_PRECISION}
output_dir…
-
-
### Describe the issue
Hello,
I'm trying to quantize an ONNX model to INT8 using the ONNX Runtime tools provided [here](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/…
-
Qwen has released some quantized models
Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4
Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4
Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int8
Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int8
since t…