how to choose which layers to quant for faster performace?

NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.

https://developer.nvidia.com/tensorrt

Apache License 2.0

10.57k stars 2.1k forks source link

how to choose which layers to quant for faster performace? #3763

Open luoshiyong opened 5 months ago

luoshiyong commented 5 months ago

in the process of yolov8 int8 quant, i find that some layers(int8) is slower than fp16, and the reformat operation is very time-consuming, for best presion, we can do sensitive-layer analysise to get the proper layer to quant , but for best speed, how should i du to identify which layer to quant? (some screenshot blow)

lix19937 commented 5 months ago

ref ptq svg of engine.

luoshiyong commented 5 months ago

ref ptq svg of engine.

By the time you see the graph above, I've already referenced svg. My problem is how to efficiently find which layers are faster when quantised, rather than taking my time to go through the graphs

lix19937 commented 5 months ago

ref https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#qdq-placement-recs @luoshiyong