Open lix19937 opened 1 week ago
loop @brb-nv
@lix19937 , TRT only have quantized INT8, no vanilla INT8 today. And you are right, only quantized int8 is supported in plugin. bool/uint8/vanilla int8 are not supported in plugin today.
For your case, besides call setDynamicRange(-128, 127)
, you can also use Q/DQ in your network, using bellow pattern:
Q (scale 1) -> plugin(int8 input)
No calibration needed for both workarounds.
@lix19937 , since you have pattern:
input -> plugin
And the plugin implementation is blackbox to trt, So we could also workaround this by pack your INT8/UINT8 input as kINT32, feed TRT as kINT32 input, and inside your plugin implementation, you can read it as INT8/UINT8.
Thanks very much @ttyio
use Q/DQ in your network
This need add one mul layer before plugin
pack your INT8/UINT8 input as kINT32, feed TRT as kINT32 input, and inside your plugin implementation, you can read it as INT8/UINT8.
Just CAST const void* const* inputs
from void* to int8_t*
in enqueue, maybe more efficient.
Thanks very much @ttyio
use Q/DQ in your network
This need add one mul layer before plugin
pack your INT8/UINT8 input as kINT32, feed TRT as kINT32 input, and inside your plugin implementation, you can read it as INT8/UINT8.
Just CAST
const void* const* inputs
fromvoid* to int8_t*
in enqueue, maybe more efficient.
This should works! similar in the bert plugin, the mask input in fused mha actually not type kINT32, but we use kINT32 because the unfused version has kINT32 mask.
https://github.com/NVIDIA/TensorRT/blob/release/10.1/plugin/embLayerNormPlugin/embLayerNormPlugin.cpp#L408 https://github.com/NVIDIA/TensorRT/blob/release/10.1/plugin/bertQKVToContextPlugin/qkvToContext.cu#L706
Description
If one of model inputs is int8(like index, not by calib), other's inputs are float or int32, when I use
error from trtexec
Also, I found that as long as there is an int8 type in the input, calibration will be triggered, user need override scale with scale and zero-point. But in reality case, there are some inputs that are originally int8 and do not require quantization.
Here I use int8 type as index data type, just for vectorized access. Improve bandwidth utilization and reduce the number of thread creations. So int8/uint8/bool as input not support in trt plugin ? If not support, I think it not very reasonable.
From https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_plugin_v2.html#af502120d140358f8d8139ade5005b9f5
@ttyio @zerollzeng
Environment
TensorRT Version:8611
NVIDIA GPU:Orin-X
NVIDIA Driver Version:
CUDA Version:11.4
CUDNN Version:11.6
Operating System:ubbuntu2004
PyTorch Version (if applicable):1.13