NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.82k stars 2.13k forks source link

int8 quantization not work on bert-like embedding model #4058

Closed renne444 closed 3 months ago

renne444 commented 3 months ago

Description

Hello, I am performing int8 quantization on a BERT-like embedding model. I noticed that after quantization, the inference speed is much more slower than FP16, and the output of the trt engine is basically consistent with the FP32 percision. I suspect that the model has not completed int8 quantization actually.

I have tried both IInt8MinMaxCalibrator and IInt8EntropyCalibrator2, but neither worked. Also, I directly ran trtexec --onnx="xx" --int8 --minShapes=input_ids:1x1,attention_mask:1x1 --optShapes=input_ids:16x128,attention_mask:16x128 --maxShapes=input_ids:128x512attention_mask:128x512 and got similar results, with inference speed and inference results being the same as FP32. Do you have any idea?

Environment

TensorRT Version:

NVIDIA GPU: A100

NVIDIA Driver Version: 525.105.17

CUDA Version: 12.5

docker image: nvcr.io/nvidia/tensorrt:24.06-py3

Embedding Model Structure

XLMRobertaModel(
  (embeddings): XLMRobertaEmbeddings(
    (word_embeddings): Embedding(250002, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (token_type_embeddings): Embedding(1, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): XLMRobertaEncoder(
    (layer): ModuleList(
      (0-11): 12 x XLMRobertaLayer(
        (attention): XLMRobertaAttention(
          (self): XLMRobertaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): XLMRobertaSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): XLMRobertaIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): XLMRobertaOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
  )
  (pooler): XLMRobertaPooler(
    (dense): Linear(in_features=768, out_features=768, bias=True)
    (activation): Tanh()
  )
)

Log While Building Engine

Use MinMax calibrator

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
[08/07/2024-07:04:31] [TRT] [I] [MemUsageChange] Init CUDA: CPU +327, GPU +0, now: CPU 356, GPU 13489 (MiB)
[08/07/2024-07:04:33] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1619, GPU +352, now: CPU 2122, GPU 13841 (MiB)
[08/07/2024-07:04:33] [TRT] [W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:604] Reading dangerously large protocol message.  If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 1115194509
[08/07/2024-07:04:34] [TRT] [W] ModelImporter.cpp:420: Make sure input input_ids has Int64 binding.
[08/07/2024-07:04:34] [TRT] [W] ModelImporter.cpp:420: Make sure input attention_mask has Int64 binding.
/workspace/triton/builder.py:137: DeprecationWarning: Use Deprecated in TensorRT 10.1. Superseded by explicit quantization. instead.
  config.int8_calibrator = calibrator
/workspace/triton/builder.py:146: DeprecationWarning: Use Deprecated in TensorRT 10.1. Superseded by explicit quantization. instead.
  config.set_calibration_profile(optimize_profile)
[08/07/2024-07:04:35] [TRT] [I] Calibration table does not match calibrator algorithm type.
[08/07/2024-07:04:35] [TRT] [I] Local timing cache in use. Profiling results in this builder pass will not be stored.
[08/07/2024-07:04:39] [TRT] [I] Detected 2 inputs and 2 output network tensors.
[08/07/2024-07:04:40] [TRT] [I] Total Host Persistent Memory: 773792
[08/07/2024-07:04:40] [TRT] [I] Total Device Persistent Memory: 0
[08/07/2024-07:04:40] [TRT] [I] Total Scratch Memory: 6299648
[08/07/2024-07:04:40] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 638 steps to complete.
[08/07/2024-07:04:40] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 36.3776ms to assign 14 blocks to 638 nodes requiring 93607424 bytes.
[08/07/2024-07:04:40] [TRT] [I] Total Activation Memory: 93607424
[08/07/2024-07:04:40] [TRT] [I] Total Weights Memory: 1112178176
[08/07/2024-07:04:40] [TRT] [I] Engine generation completed in 5.11607 seconds.
[08/07/2024-07:04:40] [TRT] [I] [MS] Running engine with multi stream info
[08/07/2024-07:04:40] [TRT] [I] [MS] Number of aux streams is 2
[08/07/2024-07:04:40] [TRT] [I] [MS] Number of total worker streams is 3
[08/07/2024-07:04:40] [TRT] [I] [MS] The main stream provided by execute/enqueue calls is the first worker stream
[08/07/2024-07:04:40] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +1, GPU +89, now: CPU 1, GPU 1150 (MiB)
[08/07/2024-07:04:40] [TRT] [W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
[08/07/2024-07:04:40] [TRT] [I] Starting Calibration.
[08/07/2024-07:04:41] [TRT] [I]   Calibrated batch 0 in 0.222241 seconds.
[08/07/2024-07:04:41] [TRT] [I]   Calibrated batch 1 in 0.194591 seconds.
[08/07/2024-07:04:41] [TRT] [I]   Calibrated batch 2 in 0.187114 seconds.
[08/07/2024-07:04:41] [TRT] [I]   Calibrated batch 3 in 0.185927 seconds.
[08/07/2024-07:04:41] [TRT] [I]   Calibrated batch 4 in 0.185639 seconds.
[08/07/2024-07:04:41] [TRT] [I]   Post Processing Calibration data in 0.00342259 seconds.
[08/07/2024-07:04:41] [TRT] [I] Calibration completed in 6.2236 seconds.
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor input_ids, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor attention_mask, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor onnx::Slice_209_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_674_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.9.attention.output.LayerNorm.weight_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor /Slice_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.10.output.LayerNorm.bias_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.10.output.LayerNorm.weight_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor /Expand_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor /Unsqueeze_3_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor /Unsqueeze_4_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor /embeddings/Constant_output_0_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_76_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor /embeddings/Equal_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor /embeddings/Not_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor /embeddings/Cast_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_578_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_ShapeSlice_84_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_extractDimensionReshape_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 106) [Iterator]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_squeezeTensor_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_castHelper_93_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_95_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_createZeroTensor_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 116) [Recurrence]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 117) [ElementWise]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor /embeddings/CumSum_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor /embeddings/Mul_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor /embeddings/Cast_1_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor /embeddings/Constant_2_output_0_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_97_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor /embeddings/Add_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor embeddings.word_embeddings.weight_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_castHelper_98_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor embeddings.token_type_embeddings.weight_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_castHelper_99_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor embeddings.position_embeddings.weight_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_castHelper_100_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor embeddings.LayerNorm.weight_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor embeddings.LayerNorm.bias_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_104_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_106_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.11.attention.output.LayerNorm.weight_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_696_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.8.attention.output.LayerNorm.weight_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.8.output.LayerNorm.bias_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_576_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.8.output.LayerNorm.weight_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.0.attention.output.LayerNorm.weight_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.0.attention.output.LayerNorm.bias_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_143_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_145_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.0.output.LayerNorm.weight_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.0.output.LayerNorm.bias_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_163_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_165_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_694_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_753_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_676_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_558_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.7.output.LayerNorm.weight_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_635_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_556_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.7.attention.output.LayerNorm.weight_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_castHelper_816_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.1.attention.output.LayerNorm.weight_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.1.attention.output.LayerNorm.bias_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_202_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_204_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.1.output.LayerNorm.weight_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.1.output.LayerNorm.bias_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_222_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_224_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_519_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.6.output.LayerNorm.bias_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.9.output.LayerNorm.weight_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.10.attention.output.LayerNorm.weight_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_792_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.9.output.LayerNorm.bias_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_755_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_814_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_812_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.2.attention.output.LayerNorm.weight_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.2.attention.output.LayerNorm.bias_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_261_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_263_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.11.output.LayerNorm.bias_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.11.output.LayerNorm.weight_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.2.output.LayerNorm.weight_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.2.output.LayerNorm.bias_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_281_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_283_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_617_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.9.attention.output.LayerNorm.bias_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.7.output.LayerNorm.bias_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_637_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.7.attention.output.LayerNorm.bias_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.3.attention.output.LayerNorm.weight_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.3.attention.output.LayerNorm.bias_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_320_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_322_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.3.output.LayerNorm.weight_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.3.output.LayerNorm.bias_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_340_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_342_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_615_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_517_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.6.output.LayerNorm.weight_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.11.attention.output.LayerNorm.bias_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.10.attention.output.LayerNorm.bias_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_735_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_733_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.4.attention.output.LayerNorm.weight_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.4.attention.output.LayerNorm.bias_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_379_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_381_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.4.output.LayerNorm.weight_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.4.output.LayerNorm.bias_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_399_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_401_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_497_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.6.attention.output.LayerNorm.bias_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.8.attention.output.LayerNorm.bias_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.6.attention.output.LayerNorm.weight_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_499_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.5.attention.output.LayerNorm.weight_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.5.attention.output.LayerNorm.bias_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_438_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_440_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.5.output.LayerNorm.weight_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor encoder.layer.5.output.LayerNorm.bias_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_458_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_460_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:41] [TRT] [W] Missing scale and zero-point for tensor ONNXTRT_Broadcast_794_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[08/07/2024-07:04:42] [TRT] [I] Local timing cache in use. Profiling results in this builder pass will not be stored.
[08/07/2024-07:04:46] [TRT] [I] Detected 2 inputs and 2 output network tensors.
[08/07/2024-07:04:46] [TRT] [I] Total Host Persistent Memory: 32
[08/07/2024-07:04:46] [TRT] [I] Total Device Persistent Memory: 0
[08/07/2024-07:04:46] [TRT] [I] Total Scratch Memory: 704774656
[08/07/2024-07:04:46] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 2 steps to complete.
[08/07/2024-07:04:46] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 0.007844ms to assign 2 blocks to 2 nodes requiring 704775168 bytes.
[08/07/2024-07:04:46] [TRT] [I] Total Activation Memory: 704775168
[08/07/2024-07:04:46] [TRT] [I] Total Weights Memory: 1112180672
[08/07/2024-07:04:46] [TRT] [I] Engine generation completed in 4.77181 seconds.
[08/07/2024-07:04:46] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 1 MiB, GPU 1757 MiB
[08/07/2024-07:04:47] [TRT] [I] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 5145 MiB
nvpohanh commented 3 months ago

BERT-like models do not support calibration. Please use TRT ModelOpt to insert Q/DQ ops into the ONNX model: https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/onnx_ptq

renne444 commented 3 months ago

Thank you for your so rapidly response. I have a further question for my study.  Could you please explain in detail why BERT-like models do not support calibration? Is this related to the architecture or characteristics of these models?

lix19937 commented 3 months ago

Usually, TRT ptq will auto insert q-dq(implict quant), and get the best performance, especially for CNNs, but for LLM/GPT-liked model, it will drop in Myelin-ForeignNode(pool performance and run at fp16) or no int8 type layer support.

I see TRT oss provide a demo about bert-liked model deploy, which use custom plugins to replace some ops like mha/ln etcs, even ASP-QAT. @nvpohanh https://github.com/NVIDIA/TensorRT/tree/release/10.2/demo/BERT

renne444 commented 3 months ago

Thanks for your reply.