-
压缩好的 bert mrpc模型,使用paddle_inference_eval验证性能,发现int8和fp32精度差距很大
--precision=fp32 84
--precision=fp16 84
--precision=int8 61
oydf updated
9 months ago
-
I test it with batch_num=1, seq_len=128, head_num=5, head_dim=64. It shows "FMHA Inference took 75.82559204 ms, 17.97742325 GFlop/s, 0.01728598 GB/s INT8 average absolute deviation: 1.552685 %". B…
-
## Description
I recently attempted to utilize INT8 quantization with Stable Diffusion XL to enhance inference performance based on the claims made in a recent [TensorRT blog post](https://developer.…
teith updated
6 months ago
-
I observed that nv_full INT8 inference on VP is taking more time than with FP16 inference.
NVDLA HW branch: nvdlav1, config: nv_full
NVDLA SW branch: Latest with INT8 option in nvdla_compiler
Ple…
-
I have followed the instruction provided by @fsx950223 to create a int8 quantized tflite model. The quantization was for weights and layers output. The tflite obtained from a efficientdet-d2 checkpoin…
-
### 실험 계획
- 어떤 tuning 방법을 사용했을때 memory efficient한가?
- 어떤 quantization 방법을 사용했을 때 inference time에서 정확도가 높은가?
#### Finetuning 과정에서 메모리 사용량 비교군
1. Full finetuning
2. LoRA tuning
3. llm.int8() + L…
-
Hi, thanks for your wonderful work. However, I have got very different results from cpu and edgetpu. In the following image, left one is the result using cpu, and the right one is the result using edg…
-
Hi, guys. I notice that BigDL utilizes BigDL nano and ggml to accelerate int8/int4 computations. I wonder how to invoke these APIs in LLMs like LLAMA. Specifically, I want to accelerate the linear lay…
-
Firstly, thanks for this project that is of high quality.
I converte my model with torch2trt in code:
...
model_trt_float32 = torch2trt( my_model,[ims],max_batch_size=32);
model_trt…
-
## Description
I generated calibration cache for Vision Transformer onnx model using EntropyCalibration2 method. When trying to generate engine file using cache file for INT8 precision using trte…