-
## Environment
- RTX8000 GPU
- TensorRT-LLM v0.9.0
## Model
- LLaVA v1.5 7B (LLaMA2 7B)
- fp16 and int8/int4 weight quantization
- batchsize = 16
## Script
- official `examples/multimodal/run.…
-
Hi,
Thanks for the great work!
Have your team tried QAT/PTQ int8 quantization on star operations? After all, the networks are usually quantized before deploying in real production.
Thanks for…
-
Dear Developers,
I am very new to Tensorrt and quantization. Previously I only use the basic example of Tensorrt to generate engines in FP16 because I thought INT8 will compromise accuracy signific…
-
### Feature request
Hi, I've created a 4-bit quantized model using `BitsAndBytesConfig`, for example
```
from transformers import AutoModelForTokenClassification, BitsAndBytesConfig
from optim…
-
### Describe the issue
After quantization, the output ONNX model had faster inference speed and smaller model size, but why are the input and output tensors still float32?
I thought it should be u…
-
Model: Llama-2-7b-chat
CT2 version: 3.19.0
I found that when I use a int8* quantization, the inference speed drastically depends on the num_hypotheses.
I tried to benchmark my model with a batch …
-
## Description
Hi, I notice form [Issue](https://github.com/NVIDIA/TensorRT/issues/3243#issuecomment-1714849183) that the int8 MHA_v2 kernel only supports SeqLen > 512. I use pytorch_quantization t…
-
I am trying dynamic quantization for Hugging face T5-small model in graviton3 .I have used
``` torch.quantization.quantize_dynamic(model, qconfig_spec={torch.nn.Linear}, dtype=torch.qint8) ```
In…
-
### Description
With int8 & int4 and any further quantization schemes we will provide, it is possible that to achieve adequate recall, some oversampling & rescoring with the raw float32 vectors might…
-
Using FBGEMM to support CPU quantization.