int8-quantization Search Results

1000+ results
for int8-quantization

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

NVIDIA/TensorRT-LLM #1565

[Quantization] Long latency for generating first token

## Environment - RTX8000 GPU - TensorRT-LLM v0.9.0 ## Model - LLaVA v1.5 7B (LLaMA2 7B) - fp16 and int8/int4 weight quantization - batchsize = 16 ## Script - official `examples/multimodal/run.…

youki-sada updated 1 month ago
5
ma-xu/Rewrite-the-Stars #10

About quantization

Hi, Thanks for the great work! Have your team tried QAT/PTQ int8 quantization on star operations? After all, the networks are usually quantized before deploying in real production. Thanks for…

julightzhong10 updated 3 weeks ago
2
NVIDIA/TensorRT #1847

Failed INT8 quantization.

Dear Developers, I am very new to Tensorrt and quantization. Previously I only use the basic example of Tensorrt to generate engines in FP16 because I thought INT8 will compromise accuracy signific…

deephog updated 1 year ago
7
huggingface/optimum #1914

ONNX export support for 4-bit quantized models

### Feature request Hi, I've created a 4-bit quantized model using `BitsAndBytesConfig`, for example ``` from transformers import AutoModelForTokenClassification, BitsAndBytesConfig from optim…

ideasbyjin updated 6 days ago
1
microsoft/onnxruntime #21138

Quantized ONNX Model Still Has Float32 Input/Output Tensors

### Describe the issue After quantization, the output ONNX model had faster inference speed and smaller model size, but why are the input and output tensors still float32? I thought it should be u…

jenchun-potentialmotors updated 5 days ago
2
OpenNMT/CTranslate2 #1482

Weird speed behavior int8* quantization

Model: Llama-2-7b-chat CT2 version: 3.19.0 I found that when I use a int8* quantization, the inference speed drastically depends on the num_hypotheses. I tried to benchmark my model with a batch …

b-joris updated 7 months ago
2
NVIDIA/TensorRT #3714

Is there a way to activate int8 MHA_v2 kernel when SeqLen > …

## Description Hi, I notice form [Issue](https://github.com/NVIDIA/TensorRT/issues/3243#issuecomment-1714849183) that the int8 MHA_v2 kernel only supports SeqLen > 512. I use pytorch_quantization t…

zhexinli updated 1 month ago
4
pytorch/pytorch #127062

T5 -small Dynamic quantization in graviton3

I am trying dynamic quantization for Hugging face T5-small model in graviton3 .I have used ``` torch.quantization.quantize_dynamic(model, qconfig_spec={torch.nn.Linear}, dtype=torch.qint8) ``` In…

akote123 updated 1 month ago
9
elastic/elasticsearch #109646

Add oversampling & rescoring options to `knn` query

### Description With int8 & int4 and any further quantization schemes we will provide, it is possible that to achieve adequate recall, some oversampling & rescoring with the raw float32 vectors might…

benwtrent updated 2 weeks ago
2
Tencent/TurboTransformers #103

Developing CPU INT8 quantization

Using FBGEMM to support CPU quantization.

feifeibear updated 3 years ago
5

上一页 1...2 3 4 5 6 7 8...100 下一页

1000+ results for int8-quantization

1000+ results
for int8-quantization