Open RuRo opened 1 month ago
If you use native trt-api to build network, you can ref trtexec --best --onnx=fp32.onnx --dumpLayerInfo --exportLayerInfo=layer.log
from layer.log we can get some info.
Also you can follow this sample https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/onnx_ptq
Hi, sorry for the delayed answer. Here are the layer logs for the two models (with --best
and with --int8
). Although the contents of those files don't look particularly useful to me.
vision_vit_best.log vision_vit_int8.log
timm_vit_best.log timm_vit_int8.log
Also, I am using the simplest stock ViT models (see the reproduction script in the original post), so you should theoretically be able to reproduce my results and get any extra debugging information you need.
Regarding TensorRT-Model-Optimizer, I'll try it, but the current situation is honestly quite annoying. There are too many supposedly "official" (or at least endorsed) ways to do the same thing and most of them either don't work at all, or produce suboptimal results (and they often don't give easily interpretable outputs that could be used to verify that they are doing the right thing).
Here's a non-exhaustive list of supposedly "official" (endorsed by either PyTorch or TensorRT) quantization methods that support post-training quantization of PyTorch models for TensorRT inference:
I think that TensorRT and PyTorch could benefit from concentrating their efforts on a single project instead of duplicating the development efforts.
I think you can ref https://github.com/NVIDIA/TensorRT/tree/release/10.2/demo/BERT
Description
I am trying to figure out if TensoRT and the
pytorch_quantization
module support post-training quantization for vision transformers.The following piece of code follows the
pytorch_quantization
docs almost verbatim (with small changes for compatibility):After that, I visualize the resulting engine graph with
trex
:The conversion succeeds, however, the graph barely uses any INT8 operations. I would have expected almost the whole graph to consist of
Int8
operators, but instead most edges in the graph are labeled asFloat
with only a fewInt8
s.Is this expected? My understanding was that most operators in transformers were supposed to be quantizable (with the notable exception of
LayerNorm
andSoftmax
, which would require special custom layers for quantization).Relevant Files
vit_tiny_patch16_224 (timm)
![timm_vit onnx engine graph json](https://github.com/NVIDIA/TensorRT/assets/3747318/6a38d8fa-45e8-46b4-8098-21a1202bca2d)vit_b_16 (torchvision)
![vision_vit onnx engine graph json](https://github.com/NVIDIA/TensorRT/assets/3747318/c8ffb3cf-8c30-491a-9c43-60d9e9c37dd5)Environment
TensorRT Version: 10.0.0.6
NVIDIA GPU: NVIDIA RTX A6000
NVIDIA Driver Version: 535.171.04
CUDA Version: 12.2
CUDNN Version: 8
Operating System: Ubuntu 22.04
Python Version (if applicable): 3.10.12
PyTorch Version (if applicable): 2.3.0
Baremetal or Container (if so, version):
nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04
docker container