NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.15k stars 2.08k forks source link

int8/uint8/bool as input not support in trt plugin ? #3959

Open lix19937 opened 1 week ago

lix19937 commented 1 week ago

Description

If one of model inputs is int8(like index, not by calib), other's inputs are float or int32, when I use

trtexec --onnx=./sca.onnx --plugins=./libplugin_custom.so --verbose --inputIOFormats=fp32:chw,fp32:chw,int8:chw \
--outputIOFormats=int32:chw,int32:chw,fp32:chw,fp32:chw,fp32:chw  

or  

trtexec --onnx=./sca.onnx --plugins=./libplugin_custom.so --verbose --inputIOFormats=fp16:chw,fp16:chw,int8:chw \
--outputIOFormats=int32:chw,int32:chw,fp16:chw,fp16:chw,fp16:chw  --fp16    

error from trtexec

[06/22/2024-21:01:54] [E] Error[9]: [pluginV2Builder.cpp::reportPluginError::23] Error Code 9: Internal Error (/SCA_IndexRebatch_TRT: could not find any supported formats consistent with input/output data types)
[06/22/2024-21:01:54] [E] Error[2]: [builder.cpp::buildSerializedNetwork::743] Error Code 2: Internal Error (Assertion engine != nullptr failed. )

Also, I found that as long as there is an int8 type in the input, calibration will be triggered, user need override scale with scale and zero-point. But in reality case, there are some inputs that are originally int8 and do not require quantization.
Here I use int8 type as index data type, just for vectorized access. Improve bandwidth utilization and reduce the number of thread creations. So int8/uint8/bool as input not support in trt plugin ? If not support, I think it not very reasonable.

From https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_plugin_v2.html#af502120d140358f8d8139ade5005b9f5

Warning

for the format field, the values PluginFormat::kCHW4, PluginFormat::kCHW16, and PluginFormat::kCHW32 will not be passed in, this is to keep backward compatibility with TensorRT 5.x series. Use PluginV2IOExt or PluginV2DynamicExt for other PluginFormats. DataType:kBOOL and DataType::kUINT8 are not supported.

@ttyio @zerollzeng

Environment

TensorRT Version:8611

NVIDIA GPU:Orin-X

NVIDIA Driver Version:

CUDA Version:11.4

CUDNN Version:11.6

Operating System:ubbuntu2004

PyTorch Version (if applicable):1.13

lix19937 commented 1 week ago

loop @brb-nv

ttyio commented 1 week ago

@lix19937 , TRT only have quantized INT8, no vanilla INT8 today. And you are right, only quantized int8 is supported in plugin. bool/uint8/vanilla int8 are not supported in plugin today.

For your case, besides call setDynamicRange(-128, 127), you can also use Q/DQ in your network, using bellow pattern:

      Q (scale 1) -> plugin(int8 input)

No calibration needed for both workarounds.

ttyio commented 1 week ago

@lix19937 , since you have pattern:

   input -> plugin

And the plugin implementation is blackbox to trt, So we could also workaround this by pack your INT8/UINT8 input as kINT32, feed TRT as kINT32 input, and inside your plugin implementation, you can read it as INT8/UINT8.

lix19937 commented 1 week ago

Thanks very much @ttyio

use Q/DQ in your network

This need add one mul layer before plugin

pack your INT8/UINT8 input as kINT32, feed TRT as kINT32 input, and inside your plugin implementation, you can read it as INT8/UINT8.

Just CAST const void* const* inputs from void* to int8_t* in enqueue, maybe more efficient.

lix19937 commented 1 week ago

Similar case https://github.com/NVIDIA/TensorRT/issues/1792#issuecomment-2187655265

ttyio commented 1 week ago

Thanks very much @ttyio

use Q/DQ in your network

This need add one mul layer before plugin

pack your INT8/UINT8 input as kINT32, feed TRT as kINT32 input, and inside your plugin implementation, you can read it as INT8/UINT8.

Just CAST const void* const* inputs from void* to int8_t* in enqueue, maybe more efficient.

This should works! similar in the bert plugin, the mask input in fused mha actually not type kINT32, but we use kINT32 because the unfused version has kINT32 mask.

https://github.com/NVIDIA/TensorRT/blob/release/10.1/plugin/embLayerNormPlugin/embLayerNormPlugin.cpp#L408 https://github.com/NVIDIA/TensorRT/blob/release/10.1/plugin/bertQKVToContextPlugin/qkvToContext.cu#L706