-
I have a question. Can the Vitis AI quantizer be used with formats other than INT8 on the **ZCU104**? Also, after quantization, is the computation performed using INT8 or is it just stored as INT8? If…
-
Hi @tridao, we recently implemented INT8 forward FMHA (8-bit Flash-Attention) with both static and dynamic quantization for Softmax on our GPGPU card, and achieved good results and relatively okay acc…
-
When converting the model, I enable the quantization to 'int8', but I noticed a decrease in performance of the converted model by 5 points in terms of BLEU.
Therefore, I would like to inquire if the…
-
### Describe the issue
After quantization, the output ONNX model had faster inference speed and smaller model size, but why are the input and output tensors still float32?
I thought it should be u…
-
### Description
With int8 & int4 and any further quantization schemes we will provide, it is possible that to achieve adequate recall, some oversampling & rescoring with the raw float32 vectors might…
-
### Describe the issue
It appears that when processing an independent Quantizelinear layer in onnxruntime, the rounding behavior is consistently rounding to the lower integer instead of the expected …
-
How many GPU memory will be used to quant flux-dev ?
Can be offload to cpu when not enough GPU memory ?
The following part of your input was truncated because CLIP can only handle sequences up to 77…
-
**Question**:
I have an encoder decoder model, quantized using TensorRT's packages for post-training quantization. It is in the HuggingFace transformers saved model format. The model is a TrOCR model…
-
### System information
Linux OpenSuse Tumbleweed
- TensorFlow installation : pip
- TensorFlow library : Tf-nightly, occurs on earlier versions too
### Code
Converting a model containing an …
-
Was chatting with @Chillee about our plans in AO today and he mentioned we should be focusing on a few concrete problems like
1. Demonstrate compelling perf for fp8 gemm at a variety of batch sizes.
…