proevgenii commented 1 year ago

Description

I'm trying to apply PTQ and QAT procedure to ViT model Using this example notebook: https://github.com/NVIDIA/TensorRT/blob/a167852705d74bcc619d8fad0af4b9e4d84472fc/quickstart/quantization_tutorial/qat-ptq-workflow.ipynb And when I'm trying to convert model to .trt Get error:

[07/17/2023-11:30:57] [E] Error[2]: [dims.h::volume::47] Error Code 2: Internal Error (Assertion start <= stop failed. )
[07/17/2023-11:30:57] [E] Engine could not be created from network
[07/17/2023-11:30:57] [E] Building engine failed
[07/17/2023-11:30:57] [E] Failed to create engine from model or file.
[07/17/2023-11:30:57] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8601] # trtexec --onnx=model.onnx --explicitBatch --workspace=8000 --saveEngine=model.trt --int8

Steps To Reproduce

So I load my pretrained ViT model from timm:

model_name = 'vit_base_patch32_224_clip_laion2b'
q_model = timm.create_model(model_name,  pretrained = True, num_classes=num_cls, exportable=True, scriptable=True)        
model_dct = torch.load(checkpoints_path, map_location = device)
q_model.load_state_dict(model_dct['state_dict_ema'])
q_model = q_model.eval().to(device)

Run this functions, and it gives me many warnings

#Calibrate the model using percentile calibration technique.
with torch.no_grad():
collect_stats(q_model, dataloader_train, num_batches=32)
compute_amax(q_model, method="max")

Then export to ONNX:


# Set static member of TensorQuantizer to use Pytorch’s own fake quantization functions
quant_nn.TensorQuantizer.use_fb_fake_quant = True

Exporting to ONNX

dummy_input = torch.randn(1, 3, 224, 224, device='cuda') torch.onnx.export( q_model, dummy_input, "model.onnx", input_names=["input"], output_names=["output"], verbose=False, opset_version=13, do_constant_folding = False)

4. Export to .trt gives me an error

!trtexec --onnx=model.onnx \ --explicitBatch \ --workspace=8000 \ --saveEngine=model.trt \ --int8

<details>

<summary>Full Output here</summary>

&&&& RUNNING TensorRT.trtexec [TensorRT v8601] # trtexec --onnx=model.onnx --explicitBatch --workspace=8000 --saveEngine=model.trt --int8 [07/17/2023-11:30:46] [W] --explicitBatch flag has been deprecated and has no effect! [07/17/2023-11:30:46] [W] Explicit batch dim is automatically enabled if input model is ONNX or if dynamic shapes are provided when the engine is built. [07/17/2023-11:30:46] [W] --workspace flag has been deprecated by --memPoolSize flag. [07/17/2023-11:30:46] [I] === Model Options === [07/17/2023-11:30:46] [I] Format: ONNX [07/17/2023-11:30:46] [I] Model: model.onnx [07/17/2023-11:30:46] [I] Output: [07/17/2023-11:30:46] [I] === Build Options === [07/17/2023-11:30:46] [I] Max batch: explicit batch [07/17/2023-11:30:46] [I] Memory Pools: workspace: 8000 MiB, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default [07/17/2023-11:30:46] [I] minTiming: 1 [07/17/2023-11:30:46] [I] avgTiming: 8 [07/17/2023-11:30:46] [I] Precision: FP32+INT8 [07/17/2023-11:30:46] [I] LayerPrecisions: [07/17/2023-11:30:46] [I] Layer Device Types: [07/17/2023-11:30:46] [I] Calibration: Dynamic [07/17/2023-11:30:46] [I] Refit: Disabled [07/17/2023-11:30:46] [I] Version Compatible: Disabled [07/17/2023-11:30:46] [I] TensorRT runtime: full [07/17/2023-11:30:46] [I] Lean DLL Path: [07/17/2023-11:30:46] [I] Tempfile Controls: { in_memory: allow, temporary: allow } [07/17/2023-11:30:46] [I] Exclude Lean Runtime: Disabled [07/17/2023-11:30:46] [I] Sparsity: Disabled [07/17/2023-11:30:46] [I] Safe mode: Disabled [07/17/2023-11:30:46] [I] Build DLA standalone loadable: Disabled [07/17/2023-11:30:46] [I] Allow GPU fallback for DLA: Disabled [07/17/2023-11:30:46] [I] DirectIO mode: Disabled [07/17/2023-11:30:46] [I] Restricted mode: Disabled [07/17/2023-11:30:46] [I] Skip inference: Disabled [07/17/2023-11:30:46] [I] Save engine: model.trt [07/17/2023-11:30:46] [I] Load engine: [07/17/2023-11:30:46] [I] Profiling verbosity: 0 [07/17/2023-11:30:46] [I] Tactic sources: Using default tactic sources [07/17/2023-11:30:46] [I] timingCacheMode: local [07/17/2023-11:30:46] [I] timingCacheFile: [07/17/2023-11:30:46] [I] Heuristic: Disabled [07/17/2023-11:30:46] [I] Preview Features: Use default preview flags. [07/17/2023-11:30:46] [I] MaxAuxStreams: -1 [07/17/2023-11:30:46] [I] BuilderOptimizationLevel: -1 [07/17/2023-11:30:46] [I] Input(s)s format: fp32:CHW [07/17/2023-11:30:46] [I] Output(s)s format: fp32:CHW [07/17/2023-11:30:46] [I] Input build shapes: model [07/17/2023-11:30:46] [I] Input calibration shapes: model [07/17/2023-11:30:46] [I] === System Options === [07/17/2023-11:30:46] [I] Device: 0 [07/17/2023-11:30:46] [I] DLACore: [07/17/2023-11:30:46] [I] Plugins: [07/17/2023-11:30:46] [I] setPluginsToSerialize: [07/17/2023-11:30:46] [I] dynamicPlugins: [07/17/2023-11:30:46] [I] ignoreParsedPluginLibs: 0 [07/17/2023-11:30:46] [I] [07/17/2023-11:30:46] [I] === Inference Options === [07/17/2023-11:30:46] [I] Batch: Explicit [07/17/2023-11:30:46] [I] Input inference shapes: model [07/17/2023-11:30:46] [I] Iterations: 10 [07/17/2023-11:30:46] [I] Duration: 3s (+ 200ms warm up) [07/17/2023-11:30:46] [I] Sleep time: 0ms [07/17/2023-11:30:46] [I] Idle time: 0ms [07/17/2023-11:30:46] [I] Inference Streams: 1 [07/17/2023-11:30:46] [I] ExposeDMA: Disabled [07/17/2023-11:30:46] [I] Data transfers: Enabled [07/17/2023-11:30:46] [I] Spin-wait: Disabled [07/17/2023-11:30:46] [I] Multithreading: Disabled [07/17/2023-11:30:46] [I] CUDA Graph: Disabled [07/17/2023-11:30:46] [I] Separate profiling: Disabled [07/17/2023-11:30:46] [I] Time Deserialize: Disabled [07/17/2023-11:30:46] [I] Time Refit: Disabled [07/17/2023-11:30:46] [I] NVTX verbosity: 0 [07/17/2023-11:30:46] [I] Persistent Cache Ratio: 0 [07/17/2023-11:30:46] [I] Inputs: [07/17/2023-11:30:46] [I] === Reporting Options === [07/17/2023-11:30:46] [I] Verbose: Disabled [07/17/2023-11:30:46] [I] Averages: 10 inferences [07/17/2023-11:30:46] [I] Percentiles: 90,95,99 [07/17/2023-11:30:46] [I] Dump refittable layers:Disabled [07/17/2023-11:30:46] [I] Dump output: Disabled [07/17/2023-11:30:46] [I] Profile: Disabled [07/17/2023-11:30:46] [I] Export timing to JSON file: [07/17/2023-11:30:46] [I] Export output to JSON file: [07/17/2023-11:30:46] [I] Export profile to JSON file: [07/17/2023-11:30:46] [I] [07/17/2023-11:30:46] [I] === Device Information === [07/17/2023-11:30:46] [I] Selected Device: Tesla T4 [07/17/2023-11:30:46] [I] Compute Capability: 7.5 [07/17/2023-11:30:46] [I] SMs: 40 [07/17/2023-11:30:46] [I] Device Global Memory: 15109 MiB [07/17/2023-11:30:46] [I] Shared Memory per SM: 64 KiB [07/17/2023-11:30:46] [I] Memory Bus Width: 256 bits (ECC enabled) [07/17/2023-11:30:46] [I] Application Compute Clock Rate: 1.59 GHz [07/17/2023-11:30:46] [I] Application Memory Clock Rate: 5.001 GHz [07/17/2023-11:30:46] [I] [07/17/2023-11:30:46] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at. [07/17/2023-11:30:46] [I] [07/17/2023-11:30:46] [I] TensorRT version: 8.6.1 [07/17/2023-11:30:46] [I] Loading standard plugins [07/17/2023-11:30:46] [I] [TRT] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 19, GPU 8588 (MiB) [07/17/2023-11:30:55] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +888, GPU +172, now: CPU 983, GPU 8760 (MiB) [07/17/2023-11:30:55] [I] Start parsing network model. [07/17/2023-11:30:56] [I] [TRT] ---------------------------------------------------------------- [07/17/2023-11:30:56] [I] [TRT] Input filename: model.onnx [07/17/2023-11:30:56] [I] [TRT] ONNX IR version: 0.0.7 [07/17/2023-11:30:56] [I] [TRT] Opset version: 13 [07/17/2023-11:30:56] [I] [TRT] Producer name: pytorch [07/17/2023-11:30:56] [I] [TRT] Producer version: 2.1.0 [07/17/2023-11:30:56] [I] [TRT] Domain:
[07/17/2023-11:30:56] [I] [TRT] Model version: 0 [07/17/2023-11:30:56] [I] [TRT] Doc string:
[07/17/2023-11:30:56] [I] [TRT] ---------------------------------------------------------------- [07/17/2023-11:30:56] [W] [TRT] onnx2trt_utils.cpp:514: Your ONNX model has been generated with double-typed weights, while TensorRT does not natively support double. Attempting to cast down to float. [07/17/2023-11:30:56] [W] [TRT] onnx2trt_utils.cpp:374: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32. [07/17/2023-11:30:56] [I] Finished parsing network model. Parse time: 0.932966 [07/17/2023-11:30:56] [I] FP32 and INT8 precisions have been specified - more performance might be enabled by additionally specifying --fp16 or --best [07/17/2023-11:30:56] [I] [TRT] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32. [07/17/2023-11:30:56] [W] [TRT] Calibrator won't be used in explicit precision mode. Use quantization aware training to generate network with Quantize/Dequantize nodes. [07/17/2023-11:30:57] [E] Error[2]: [dims.h::volume::47] Error Code 2: Internal Error (Assertion start <= stop failed. ) [07/17/2023-11:30:57] [E] Engine could not be created from network [07/17/2023-11:30:57] [E] Building engine failed [07/17/2023-11:30:57] [E] Failed to create engine from model or file. [07/17/2023-11:30:57] [E] Engine set up failed &&&& FAILED TensorRT.trtexec [TensorRT v8601] # trtexec --onnx=model.onnx --explicitBatch --workspace=8000 --saveEngine=model.trt --int8


</details>

And I also tried similar command, which gives me another error :(

!trtexec --onnx=model.onnx \ --explicitBatch \ --workspace=8000 \ --saveEngine=model.trt \ --inputIOFormats=int8:chw --outputIOFormats=int8:chw --int8


<details>

<summary>Full Output here</summary>

&&&& RUNNING TensorRT.trtexec [TensorRT v8601] # trtexec --onnx=model.onnx --explicitBatch --workspace=8000 --saveEngine=model.trt --inputIOFormats=int8:chw --outputIOFormats=int8:chw --int8 [07/17/2023-12:22:42] [W] --explicitBatch flag has been deprecated and has no effect! [07/17/2023-12:22:42] [W] Explicit batch dim is automatically enabled if input model is ONNX or if dynamic shapes are provided when the engine is built. [07/17/2023-12:22:42] [W] --workspace flag has been deprecated by --memPoolSize flag. [07/17/2023-12:22:42] [I] === Model Options === [07/17/2023-12:22:42] [I] Format: ONNX [07/17/2023-12:22:42] [I] Model: model.onnx [07/17/2023-12:22:42] [I] Output: [07/17/2023-12:22:42] [I] === Build Options === [07/17/2023-12:22:42] [I] Max batch: explicit batch [07/17/2023-12:22:42] [I] Memory Pools: workspace: 8000 MiB, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default [07/17/2023-12:22:42] [I] minTiming: 1 [07/17/2023-12:22:42] [I] avgTiming: 8 [07/17/2023-12:22:42] [I] Precision: FP32+INT8 [07/17/2023-12:22:42] [I] LayerPrecisions: [07/17/2023-12:22:42] [I] Layer Device Types: [07/17/2023-12:22:42] [I] Calibration: Dynamic [07/17/2023-12:22:42] [I] Refit: Disabled [07/17/2023-12:22:42] [I] Version Compatible: Disabled [07/17/2023-12:22:42] [I] TensorRT runtime: full [07/17/2023-12:22:42] [I] Lean DLL Path: [07/17/2023-12:22:42] [I] Tempfile Controls: { in_memory: allow, temporary: allow } [07/17/2023-12:22:42] [I] Exclude Lean Runtime: Disabled [07/17/2023-12:22:42] [I] Sparsity: Disabled [07/17/2023-12:22:42] [I] Safe mode: Disabled [07/17/2023-12:22:42] [I] Build DLA standalone loadable: Disabled [07/17/2023-12:22:42] [I] Allow GPU fallback for DLA: Disabled [07/17/2023-12:22:42] [I] DirectIO mode: Disabled [07/17/2023-12:22:42] [I] Restricted mode: Disabled [07/17/2023-12:22:42] [I] Skip inference: Disabled [07/17/2023-12:22:42] [I] Save engine: model.trt [07/17/2023-12:22:42] [I] Load engine: [07/17/2023-12:22:42] [I] Profiling verbosity: 0 [07/17/2023-12:22:42] [I] Tactic sources: Using default tactic sources [07/17/2023-12:22:42] [I] timingCacheMode: local [07/17/2023-12:22:42] [I] timingCacheFile: [07/17/2023-12:22:42] [I] Heuristic: Disabled [07/17/2023-12:22:42] [I] Preview Features: Use default preview flags. [07/17/2023-12:22:42] [I] MaxAuxStreams: -1 [07/17/2023-12:22:42] [I] BuilderOptimizationLevel: -1 [07/17/2023-12:22:42] [I] Input(s): int8:chw [07/17/2023-12:22:42] [I] Output(s): int8:chw [07/17/2023-12:22:42] [I] Input build shapes: model [07/17/2023-12:22:42] [I] Input calibration shapes: model [07/17/2023-12:22:42] [I] === System Options === [07/17/2023-12:22:42] [I] Device: 0 [07/17/2023-12:22:42] [I] DLACore: [07/17/2023-12:22:42] [I] Plugins: [07/17/2023-12:22:42] [I] setPluginsToSerialize: [07/17/2023-12:22:42] [I] dynamicPlugins: [07/17/2023-12:22:42] [I] ignoreParsedPluginLibs: 0 [07/17/2023-12:22:42] [I] [07/17/2023-12:22:42] [I] === Inference Options === [07/17/2023-12:22:42] [I] Batch: Explicit [07/17/2023-12:22:42] [I] Input inference shapes: model [07/17/2023-12:22:42] [I] Iterations: 10 [07/17/2023-12:22:42] [I] Duration: 3s (+ 200ms warm up) [07/17/2023-12:22:42] [I] Sleep time: 0ms [07/17/2023-12:22:42] [I] Idle time: 0ms [07/17/2023-12:22:42] [I] Inference Streams: 1 [07/17/2023-12:22:42] [I] ExposeDMA: Disabled [07/17/2023-12:22:42] [I] Data transfers: Enabled [07/17/2023-12:22:42] [I] Spin-wait: Disabled [07/17/2023-12:22:42] [I] Multithreading: Disabled [07/17/2023-12:22:42] [I] CUDA Graph: Disabled [07/17/2023-12:22:42] [I] Separate profiling: Disabled [07/17/2023-12:22:42] [I] Time Deserialize: Disabled [07/17/2023-12:22:42] [I] Time Refit: Disabled [07/17/2023-12:22:42] [I] NVTX verbosity: 0 [07/17/2023-12:22:42] [I] Persistent Cache Ratio: 0 [07/17/2023-12:22:42] [I] Inputs: [07/17/2023-12:22:42] [I] === Reporting Options === [07/17/2023-12:22:42] [I] Verbose: Disabled [07/17/2023-12:22:42] [I] Averages: 10 inferences [07/17/2023-12:22:42] [I] Percentiles: 90,95,99 [07/17/2023-12:22:42] [I] Dump refittable layers:Disabled [07/17/2023-12:22:42] [I] Dump output: Disabled [07/17/2023-12:22:42] [I] Profile: Disabled [07/17/2023-12:22:42] [I] Export timing to JSON file: [07/17/2023-12:22:42] [I] Export output to JSON file: [07/17/2023-12:22:42] [I] Export profile to JSON file: [07/17/2023-12:22:42] [I] [07/17/2023-12:22:42] [I] === Device Information === [07/17/2023-12:22:42] [I] Selected Device: Tesla T4 [07/17/2023-12:22:42] [I] Compute Capability: 7.5 [07/17/2023-12:22:42] [I] SMs: 40 [07/17/2023-12:22:42] [I] Device Global Memory: 15109 MiB [07/17/2023-12:22:42] [I] Shared Memory per SM: 64 KiB [07/17/2023-12:22:42] [I] Memory Bus Width: 256 bits (ECC enabled) [07/17/2023-12:22:42] [I] Application Compute Clock Rate: 1.59 GHz [07/17/2023-12:22:42] [I] Application Memory Clock Rate: 5.001 GHz [07/17/2023-12:22:42] [I] [07/17/2023-12:22:42] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at. [07/17/2023-12:22:42] [I] [07/17/2023-12:22:42] [I] TensorRT version: 8.6.1 [07/17/2023-12:22:42] [I] Loading standard plugins [07/17/2023-12:22:42] [I] [TRT] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 19, GPU 8588 (MiB) [07/17/2023-12:22:51] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +888, GPU +172, now: CPU 983, GPU 8760 (MiB) [07/17/2023-12:22:51] [I] Start parsing network model. [07/17/2023-12:22:52] [I] [TRT] ---------------------------------------------------------------- [07/17/2023-12:22:52] [I] [TRT] Input filename: model.onnx [07/17/2023-12:22:52] [I] [TRT] ONNX IR version: 0.0.7 [07/17/2023-12:22:52] [I] [TRT] Opset version: 13 [07/17/2023-12:22:52] [I] [TRT] Producer name: pytorch [07/17/2023-12:22:52] [I] [TRT] Producer version: 2.1.0 [07/17/2023-12:22:52] [I] [TRT] Domain:
[07/17/2023-12:22:52] [I] [TRT] Model version: 0 [07/17/2023-12:22:52] [I] [TRT] Doc string:
[07/17/2023-12:22:52] [I] [TRT] ---------------------------------------------------------------- [07/17/2023-12:22:52] [W] [TRT] onnx2trt_utils.cpp:514: Your ONNX model has been generated with double-typed weights, while TensorRT does not natively support double. Attempting to cast down to float. [07/17/2023-12:22:52] [W] [TRT] onnx2trt_utils.cpp:374: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32. [07/17/2023-12:22:52] [I] Finished parsing network model. Parse time: 0.918563 [07/17/2023-12:22:52] [I] FP32 and INT8 precisions have been specified - more performance might be enabled by additionally specifying --fp16 or --best [07/17/2023-12:22:52] [I] [TRT] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32. [07/17/2023-12:22:52] [W] [TRT] Calibrator won't be used in explicit precision mode. Use quantization aware training to generate network with Quantize/Dequantize nodes. [07/17/2023-12:22:52] [E] Error[1]: [qdqGraphOptimizer.cpp::quantizePaths::3913] Error Code 1: Internal Error (Node /patch_embed/proj/_input_quantizer/QuantizeLinear cannot be quantized by input. You might want to add a DQ node before /patch_embed/proj/_input_quantizer/QuantizeLinear ) [07/17/2023-12:22:52] [E] Engine could not be created from network [07/17/2023-12:22:52] [E] Building engine failed [07/17/2023-12:22:52] [E] Failed to create engine from model or file. [07/17/2023-12:22:52] [E] Engine set up failed &&&& FAILED TensorRT.trtexec [TensorRT v8601] # trtexec --onnx=model.onnx --explicitBatch --workspace=8000 --saveEngine=model.trt --inputIOFormats=int8:chw --outputIOFormats=int8:chw --int8


</details>

`polygraphy run <model.onnx> --onnxrt`:
<details>

<summary>Output here</summary>

[W] 'colored' module is not installed, will not use colors when logging. To enable colors, please install the 'colored' module: python3 -m pip install colored [I] RUNNING | Command: /usr/local/bin/polygraphy run model.onnx --onnxrt [I] onnxrt-runner-N0-07/17/23-12:24:47 | Activating and starting inference [I] Creating ONNX-Runtime Inference Session with providers: ['CPUExecutionProvider'] [I] onnxrt-runner-N0-07/17/23-12:24:47 ---- Inference Input(s) ---- {input [dtype=float32, shape=(1, 3, 224, 224)]} [I] onnxrt-runner-N0-07/17/23-12:24:47 ---- Inference Output(s) ---- {output [dtype=float32, shape=(1, 6)]} [I] onnxrt-runner-N0-07/17/23-12:24:47 | Completed 1 iteration(s) in 28.23 ms | Average inference time: 28.23 ms. [I] PASSED | Runtime: 1.435s | Command: /usr/local/bin/polygraphy run model.onnx --onnxrt



</details>

## Environment

<!-- Please share any setup information you know. This will help us to understand and address your case. -->

* **I'm running this inside Nvidia container: nvcr.io/nvidia/pytorch:23.04-py3**

**TensorRT Version**:

**NVIDIA GPU**: Tesla T4

**NVIDIA Driver Version**: 470.141.03

**CUDA Version**:12.1

Operating System:

Python Version (if applicable): Python 3.8.10

PyTorch Version (if applicable):'2.1.0a0+fe05266'

Baremetal or Container (if so, version): nvcr.io/nvidia/pytorch:23.04-py3

zerollzeng commented 1 year ago

I feel like the error is in the Q/QD placement, @ttyio what do you think?

@proevgenii Just want to double confirm: you make use of https://github.com/NVIDIA/TensorRT/blob/a167852705d74bcc619d8fad0af4b9e4d84472fc/quickstart/quantization_tutorial/qat-ptq-workflow.ipynb for your own ViT model, am I correct?

proevgenii commented 1 year ago

Yes that's right I am using this tutorial to quantize my pre-trained ViT model from timm

zerollzeng commented 1 year ago

@ttyio ^ ^

ttyio commented 1 year ago

@proevgenii Do you have the TRT verbose build log? I want to check which operator failed. Is this slice layer? thanks!

proevgenii commented 1 year ago

@ttyio Yep, here is two log files, log.txt log_outputIOFormats.txt

ttyio commented 1 year ago

@proevgenii , I did not see Internal Error (Assertion start <= stop failed. ) from your log?

proevgenii commented 1 year ago

@ttyio Yea,

But, here is screenshot how I get this log.txt, and command in cell also produced output which contains Internal Error (Assertion start <= stop failed. )
What about --inputIOFormats=int8:chw --outputIOFormats=int8:chw --int8, should I use it or not, and why it gives me different error?

proevgenii commented 1 year ago

@ttyio Any updates?

ttyio commented 1 year ago

@proevgenii sorry for the delay response, cannot tell why from the log and screenshot. could you also share the onnx file? thanks!

--inputIOFormats=int8:chw --outputIOFormats=int8:chw --int8

this is used to use int8:chw datatype and layout for network input and output, it will save some reformat at the network boundary (you need interpret those int8 data out of trt, this is useful when other part of your workflow accept int8 as input/output). By doing this you may change the precision used for the layers that near the network output tensors. Need check your onnx to see the detail reason.

proevgenii commented 1 year ago

@ttyio Here's onnx model

ttyio commented 1 year ago

@proevgenii , could you WAR by folding constant first? thanks!

  polygraphy  surgeon sanitize q_model.onnx --fold-constants --output q_model_folded.onnx

proevgenii commented 1 year ago

@ttyio Looks like it simplify onnx model graph, like onnx_simplifier which I also tried. But I run this:

polygraphy  surgeon sanitize q_model.onnx --fold-constants --output q_model_folded.onnx

Output here

``` [W] 'colored' module is not installed, will not use colors when logging. To enable colors, please install the 'colored' module: python3 -m pip install colored [I] RUNNING | Command: /usr/local/bin/polygraphy surgeon sanitize q_model.onnx --fold-constants --output q_model_folded.onnx [I] Inferring shapes in the model with `onnxruntime.tools.symbolic_shape_infer`. Note: To force Polygraphy to use `onnx.shape_inference` instead, set `allow_onnxruntime=False` or use the `--no-onnxruntime-shape-inference` command-line option. [I] Loading model: /workspace/quantization/q_model.onnx [I] Original Model: Name: torch_jit | ONNX Opset: 13 ---- 1 Graph Input(s) ---- {input [dtype=float32, shape=(1, 3, 224, 224)]} ---- 1 Graph Output(s) ---- {output [dtype=float32, shape=(1, 6)]} ---- 153 Initializer(s) ---- ---- 1519 Node(s) ---- [I] Folding Constants | Pass 1 [I] Module: 'onnx_graphsurgeon' is required, but not installed. Attempting to install now. [I] Running installation command: /usr/bin/python -m pip install onnx_graphsurgeon>=0.3.21 --extra-index-url=https://pypi.ngc.nvidia.com/ 2023-08-23 10:28:27.056647492 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.11/attn/Unsqueeze_7 2023-08-23 10:28:27.056687880 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.11/attn/Unsqueeze_6 2023-08-23 10:28:27.056698172 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.11/attn/Unsqueeze_5 2023-08-23 10:28:27.056707929 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.11/attn/Unsqueeze_4 2023-08-23 10:28:27.056717387 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.11/attn/Unsqueeze_3 2023-08-23 10:28:27.056726833 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.11/attn/Unsqueeze_2 2023-08-23 10:28:27.056737773 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.11/attn/Unsqueeze_1 2023-08-23 10:28:27.056748162 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.11/attn/Unsqueeze 2023-08-23 10:28:27.056764453 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.10/attn/Unsqueeze_7 2023-08-23 10:28:27.056774195 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.10/attn/Unsqueeze_6 2023-08-23 10:28:27.056783376 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.10/attn/Unsqueeze_5 2023-08-23 10:28:27.056792967 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.10/attn/Unsqueeze_4 2023-08-23 10:28:27.056803539 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.10/attn/Unsqueeze_3 2023-08-23 10:28:27.056814295 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.10/attn/Unsqueeze_2 2023-08-23 10:28:27.056824687 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.10/attn/Unsqueeze_1 2023-08-23 10:28:27.056834949 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.10/attn/Unsqueeze 2023-08-23 10:28:27.056849931 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.9/attn/Unsqueeze_7 2023-08-23 10:28:27.056859875 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.9/attn/Unsqueeze_6 2023-08-23 10:28:27.056869229 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.9/attn/Unsqueeze_5 2023-08-23 10:28:27.056878918 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.9/attn/Unsqueeze_4 2023-08-23 10:28:27.056889402 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.9/attn/Unsqueeze_3 2023-08-23 10:28:27.056899720 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.9/attn/Unsqueeze_2 2023-08-23 10:28:27.056909929 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.9/attn/Unsqueeze_1 2023-08-23 10:28:27.056920147 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.9/attn/Unsqueeze 2023-08-23 10:28:27.056935201 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.8/attn/Unsqueeze_7 2023-08-23 10:28:27.056945347 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.8/attn/Unsqueeze_6 2023-08-23 10:28:27.056955061 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.8/attn/Unsqueeze_5 2023-08-23 10:28:27.056965170 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.8/attn/Unsqueeze_4 2023-08-23 10:28:27.056975483 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.8/attn/Unsqueeze_3 2023-08-23 10:28:27.056986695 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.8/attn/Unsqueeze_2 2023-08-23 10:28:27.056996894 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.8/attn/Unsqueeze_1 2023-08-23 10:28:27.057006964 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.8/attn/Unsqueeze 2023-08-23 10:28:27.057020741 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.7/attn/Unsqueeze_7 2023-08-23 10:28:27.057030541 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.7/attn/Unsqueeze_6 2023-08-23 10:28:27.057039899 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.7/attn/Unsqueeze_5 2023-08-23 10:28:27.057049653 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.7/attn/Unsqueeze_4 2023-08-23 10:28:27.057059673 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.7/attn/Unsqueeze_3 2023-08-23 10:28:27.057069773 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.7/attn/Unsqueeze_2 2023-08-23 10:28:27.057079976 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.7/attn/Unsqueeze_1 2023-08-23 10:28:27.057089842 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.7/attn/Unsqueeze 2023-08-23 10:28:27.057103243 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.6/attn/Unsqueeze_7 2023-08-23 10:28:27.057113398 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.6/attn/Unsqueeze_6 2023-08-23 10:28:27.057122748 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.6/attn/Unsqueeze_5 2023-08-23 10:28:27.057132107 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.6/attn/Unsqueeze_4 2023-08-23 10:28:27.057142160 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.6/attn/Unsqueeze_3 2023-08-23 10:28:27.057152021 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.6/attn/Unsqueeze_2 2023-08-23 10:28:27.057162257 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.6/attn/Unsqueeze_1 2023-08-23 10:28:27.057172454 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.6/attn/Unsqueeze 2023-08-23 10:28:27.057186543 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.5/attn/Unsqueeze_7 2023-08-23 10:28:27.057195833 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.5/attn/Unsqueeze_6 2023-08-23 10:28:27.057205455 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.5/attn/Unsqueeze_5 2023-08-23 10:28:27.057214968 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.5/attn/Unsqueeze_4 2023-08-23 10:28:27.057225256 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.5/attn/Unsqueeze_3 2023-08-23 10:28:27.057235506 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.5/attn/Unsqueeze_2 2023-08-23 10:28:27.057245395 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.5/attn/Unsqueeze_1 2023-08-23 10:28:27.057255519 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.5/attn/Unsqueeze 2023-08-23 10:28:27.057270402 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.4/attn/Unsqueeze_7 2023-08-23 10:28:27.057280126 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.4/attn/Unsqueeze_6 2023-08-23 10:28:27.057289348 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.4/attn/Unsqueeze_5 2023-08-23 10:28:27.057300045 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.4/attn/Unsqueeze_4 2023-08-23 10:28:27.057309939 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.4/attn/Unsqueeze_3 2023-08-23 10:28:27.057329356 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.4/attn/Unsqueeze_2 2023-08-23 10:28:27.057339608 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.4/attn/Unsqueeze_1 2023-08-23 10:28:27.057349300 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.4/attn/Unsqueeze 2023-08-23 10:28:27.057362214 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.3/attn/Unsqueeze_7 2023-08-23 10:28:27.057371566 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.3/attn/Unsqueeze_6 2023-08-23 10:28:27.057380808 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.3/attn/Unsqueeze_5 2023-08-23 10:28:27.057389940 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.3/attn/Unsqueeze_4 2023-08-23 10:28:27.057400116 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.3/attn/Unsqueeze_3 2023-08-23 10:28:27.057410121 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.3/attn/Unsqueeze_2 2023-08-23 10:28:27.057419971 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.3/attn/Unsqueeze_1 2023-08-23 10:28:27.057429978 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.3/attn/Unsqueeze 2023-08-23 10:28:27.057442828 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.2/attn/Unsqueeze_7 2023-08-23 10:28:27.057451935 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.2/attn/Unsqueeze_6 2023-08-23 10:28:27.057461254 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.2/attn/Unsqueeze_5 2023-08-23 10:28:27.057470179 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.2/attn/Unsqueeze_4 2023-08-23 10:28:27.057480026 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.2/attn/Unsqueeze_3 2023-08-23 10:28:27.057491033 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.2/attn/Unsqueeze_2 2023-08-23 10:28:27.057500931 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.2/attn/Unsqueeze_1 2023-08-23 10:28:27.057510791 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.2/attn/Unsqueeze 2023-08-23 10:28:27.057523897 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.1/attn/Unsqueeze_7 2023-08-23 10:28:27.057533043 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.1/attn/Unsqueeze_6 2023-08-23 10:28:27.057541064 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.1/attn/Unsqueeze_5 2023-08-23 10:28:27.057550869 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.1/attn/Unsqueeze_4 2023-08-23 10:28:27.057560647 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.1/attn/Unsqueeze_3 2023-08-23 10:28:27.057570321 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.1/attn/Unsqueeze_2 2023-08-23 10:28:27.057580438 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.1/attn/Unsqueeze_1 2023-08-23 10:28:27.057590220 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.1/attn/Unsqueeze 2023-08-23 10:28:27.057602131 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.0/attn/Unsqueeze_7 2023-08-23 10:28:27.057611299 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.0/attn/Unsqueeze_6 2023-08-23 10:28:27.057620433 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.0/attn/Unsqueeze_5 2023-08-23 10:28:27.057629300 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.0/attn/Unsqueeze_4 2023-08-23 10:28:27.057639381 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.0/attn/Unsqueeze_3 2023-08-23 10:28:27.057649185 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.0/attn/Unsqueeze_2 2023-08-23 10:28:27.057658709 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.0/attn/Unsqueeze_1 2023-08-23 10:28:27.057669015 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /blocks/blocks.0/attn/Unsqueeze 2023-08-23 10:28:27.057679077 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /Unsqueeze_2 2023-08-23 10:28:27.057689373 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /Unsqueeze_1 2023-08-23 10:28:27.057699169 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node /Unsqueeze [I] Total Nodes | Original: 1519, After Folding: 825 | 694 Nodes Folded [I] Folding Constants | Pass 2 [I] Total Nodes | Original: 825, After Folding: 825 | 0 Nodes Folded [I] Saving ONNX model to: q_model_folded.onnx [I] New Model: Name: torch_jit | ONNX Opset: 13 ---- 1 Graph Input(s) ---- {input [dtype=float32, shape=(1, 3, 224, 224)]} ---- 1 Graph Output(s) ---- {output [dtype=float32, shape=(1, 6)]} ---- 527 Initializer(s) ---- ---- 825 Node(s) ---- [I] PASSED | Runtime: 11.916s | Command: /usr/local/bin/polygraphy surgeon sanitize q_model.onnx --fold-constants --output q_model_folded.onnx ```

Then I run, trtexec which failed with error:

[08/23/2023-10:32:14] [E] Error[10]: Could not find any implementation for node patch_embed.proj.weight + /patch_embed/proj/_weight_quantizer/QuantizeLinear + /patch_embed/proj/Conv.
[08/23/2023-10:32:14] [E] Error[10]: [optimizer.cpp::computeCosts::3869] Error Code 10: Internal Error (Could not find any implementation for node patch_embed.proj.weight + /patch_embed/proj/_weight_quantizer/QuantizeLinear + /patch_embed/proj/Conv.)

Full output here

``` &&&& RUNNING TensorRT.trtexec [TensorRT v8601] # trtexec --onnx=q_model_folded.onnx --explicitBatch --workspace=8000 --saveEngine=model.trt --int8 [08/23/2023-10:32:02] [W] --explicitBatch flag has been deprecated and has no effect! [08/23/2023-10:32:02] [W] Explicit batch dim is automatically enabled if input model is ONNX or if dynamic shapes are provided when the engine is built. [08/23/2023-10:32:02] [W] --workspace flag has been deprecated by --memPoolSize flag. [08/23/2023-10:32:02] [I] === Model Options === [08/23/2023-10:32:02] [I] Format: ONNX [08/23/2023-10:32:02] [I] Model: q_model_folded.onnx [08/23/2023-10:32:02] [I] Output: [08/23/2023-10:32:02] [I] === Build Options === [08/23/2023-10:32:02] [I] Max batch: explicit batch [08/23/2023-10:32:02] [I] Memory Pools: workspace: 8000 MiB, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default [08/23/2023-10:32:02] [I] minTiming: 1 [08/23/2023-10:32:02] [I] avgTiming: 8 [08/23/2023-10:32:02] [I] Precision: FP32+INT8 [08/23/2023-10:32:02] [I] LayerPrecisions: [08/23/2023-10:32:02] [I] Layer Device Types: [08/23/2023-10:32:02] [I] Calibration: Dynamic [08/23/2023-10:32:02] [I] Refit: Disabled [08/23/2023-10:32:02] [I] Version Compatible: Disabled [08/23/2023-10:32:02] [I] TensorRT runtime: full [08/23/2023-10:32:02] [I] Lean DLL Path: [08/23/2023-10:32:02] [I] Tempfile Controls: { in_memory: allow, temporary: allow } [08/23/2023-10:32:02] [I] Exclude Lean Runtime: Disabled [08/23/2023-10:32:02] [I] Sparsity: Disabled [08/23/2023-10:32:02] [I] Safe mode: Disabled [08/23/2023-10:32:02] [I] Build DLA standalone loadable: Disabled [08/23/2023-10:32:02] [I] Allow GPU fallback for DLA: Disabled [08/23/2023-10:32:02] [I] DirectIO mode: Disabled [08/23/2023-10:32:02] [I] Restricted mode: Disabled [08/23/2023-10:32:02] [I] Skip inference: Disabled [08/23/2023-10:32:02] [I] Save engine: model.trt [08/23/2023-10:32:02] [I] Load engine: [08/23/2023-10:32:02] [I] Profiling verbosity: 0 [08/23/2023-10:32:02] [I] Tactic sources: Using default tactic sources [08/23/2023-10:32:02] [I] timingCacheMode: local [08/23/2023-10:32:02] [I] timingCacheFile: [08/23/2023-10:32:02] [I] Heuristic: Disabled [08/23/2023-10:32:02] [I] Preview Features: Use default preview flags. [08/23/2023-10:32:02] [I] MaxAuxStreams: -1 [08/23/2023-10:32:02] [I] BuilderOptimizationLevel: -1 [08/23/2023-10:32:02] [I] Input(s)s format: fp32:CHW [08/23/2023-10:32:02] [I] Output(s)s format: fp32:CHW [08/23/2023-10:32:02] [I] Input build shapes: model [08/23/2023-10:32:02] [I] Input calibration shapes: model [08/23/2023-10:32:02] [I] === System Options === [08/23/2023-10:32:02] [I] Device: 0 [08/23/2023-10:32:02] [I] DLACore: [08/23/2023-10:32:02] [I] Plugins: [08/23/2023-10:32:02] [I] setPluginsToSerialize: [08/23/2023-10:32:02] [I] dynamicPlugins: [08/23/2023-10:32:02] [I] ignoreParsedPluginLibs: 0 [08/23/2023-10:32:02] [I] [08/23/2023-10:32:02] [I] === Inference Options === [08/23/2023-10:32:02] [I] Batch: Explicit [08/23/2023-10:32:02] [I] Input inference shapes: model [08/23/2023-10:32:02] [I] Iterations: 10 [08/23/2023-10:32:02] [I] Duration: 3s (+ 200ms warm up) [08/23/2023-10:32:02] [I] Sleep time: 0ms [08/23/2023-10:32:02] [I] Idle time: 0ms [08/23/2023-10:32:02] [I] Inference Streams: 1 [08/23/2023-10:32:02] [I] ExposeDMA: Disabled [08/23/2023-10:32:02] [I] Data transfers: Enabled [08/23/2023-10:32:02] [I] Spin-wait: Disabled [08/23/2023-10:32:02] [I] Multithreading: Disabled [08/23/2023-10:32:02] [I] CUDA Graph: Disabled [08/23/2023-10:32:02] [I] Separate profiling: Disabled [08/23/2023-10:32:02] [I] Time Deserialize: Disabled [08/23/2023-10:32:02] [I] Time Refit: Disabled [08/23/2023-10:32:02] [I] NVTX verbosity: 0 [08/23/2023-10:32:02] [I] Persistent Cache Ratio: 0 [08/23/2023-10:32:02] [I] Inputs: [08/23/2023-10:32:02] [I] === Reporting Options === [08/23/2023-10:32:02] [I] Verbose: Disabled [08/23/2023-10:32:02] [I] Averages: 10 inferences [08/23/2023-10:32:02] [I] Percentiles: 90,95,99 [08/23/2023-10:32:02] [I] Dump refittable layers:Disabled [08/23/2023-10:32:02] [I] Dump output: Disabled [08/23/2023-10:32:02] [I] Profile: Disabled [08/23/2023-10:32:02] [I] Export timing to JSON file: [08/23/2023-10:32:02] [I] Export output to JSON file: [08/23/2023-10:32:02] [I] Export profile to JSON file: [08/23/2023-10:32:02] [I] [08/23/2023-10:32:03] [I] === Device Information === [08/23/2023-10:32:03] [I] Selected Device: Tesla T4 [08/23/2023-10:32:03] [I] Compute Capability: 7.5 [08/23/2023-10:32:03] [I] SMs: 40 [08/23/2023-10:32:03] [I] Device Global Memory: 15109 MiB [08/23/2023-10:32:03] [I] Shared Memory per SM: 64 KiB [08/23/2023-10:32:03] [I] Memory Bus Width: 256 bits (ECC enabled) [08/23/2023-10:32:03] [I] Application Compute Clock Rate: 1.59 GHz [08/23/2023-10:32:03] [I] Application Memory Clock Rate: 5.001 GHz [08/23/2023-10:32:03] [I] [08/23/2023-10:32:03] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at. [08/23/2023-10:32:03] [I] [08/23/2023-10:32:03] [I] TensorRT version: 8.6.1 [08/23/2023-10:32:03] [I] Loading standard plugins [08/23/2023-10:32:03] [I] [TRT] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 19, GPU 12344 (MiB) [08/23/2023-10:32:12] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +888, GPU +172, now: CPU 983, GPU 12516 (MiB) [08/23/2023-10:32:12] [I] Start parsing network model. [08/23/2023-10:32:12] [I] [TRT] ---------------------------------------------------------------- [08/23/2023-10:32:12] [I] [TRT] Input filename: q_model_folded.onnx [08/23/2023-10:32:12] [I] [TRT] ONNX IR version: 0.0.8 [08/23/2023-10:32:12] [I] [TRT] Opset version: 13 [08/23/2023-10:32:12] [I] [TRT] Producer name: pytorch [08/23/2023-10:32:12] [I] [TRT] Producer version: 2.1.0 [08/23/2023-10:32:12] [I] [TRT] Domain: [08/23/2023-10:32:12] [I] [TRT] Model version: 0 [08/23/2023-10:32:12] [I] [TRT] Doc string: [08/23/2023-10:32:12] [I] [TRT] ---------------------------------------------------------------- [08/23/2023-10:32:13] [W] [TRT] onnx2trt_utils.cpp:374: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32. [08/23/2023-10:32:13] [I] Finished parsing network model. Parse time: 0.985084 [08/23/2023-10:32:13] [I] FP32 and INT8 precisions have been specified - more performance might be enabled by additionally specifying --fp16 or --best [08/23/2023-10:32:13] [I] [TRT] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32. [08/23/2023-10:32:13] [W] [TRT] Calibrator won't be used in explicit precision mode. Use quantization aware training to generate network with Quantize/Dequantize nodes. [08/23/2023-10:32:13] [I] [TRT] Graph optimization time: 0.287996 seconds. [08/23/2023-10:32:13] [I] [TRT] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32. [08/23/2023-10:32:13] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored. [08/23/2023-10:32:14] [E] Error[10]: Could not find any implementation for node patch_embed.proj.weight + /patch_embed/proj/_weight_quantizer/QuantizeLinear + /patch_embed/proj/Conv. [08/23/2023-10:32:14] [E] Error[10]: [optimizer.cpp::computeCosts::3869] Error Code 10: Internal Error (Could not find any implementation for node patch_embed.proj.weight + /patch_embed/proj/_weight_quantizer/QuantizeLinear + /patch_embed/proj/Conv.) [08/23/2023-10:32:14] [E] Engine could not be created from network [08/23/2023-10:32:14] [E] Building engine failed [08/23/2023-10:32:14] [E] Failed to create engine from model or file. [08/23/2023-10:32:14] [E] Engine set up failed &&&& FAILED TensorRT.trtexec [TensorRT v8601] # trtexec --onnx=q_model_folded.onnx --explicitBatch --workspace=8000 --saveEngine=model.trt --int8 ```

When I add flag: --inputIOFormats=int8:chw --outputIOFormats=int8:chw --int8 Error changes to:

[08/23/2023-10:40:41] [W] [TRT] Calibrator won't be used in explicit precision mode. Use quantization aware training to generate network with Quantize/Dequantize nodes.
[08/23/2023-10:40:41] [E] Error[1]: [qdqGraphOptimizer.cpp::quantizePaths::3913] Error Code 1: Internal Error (Node /patch_embed/proj/_input_quantizer/QuantizeLinear cannot be quantized by input. You might want to add a DQ node before /patch_embed/proj/_input_quantizer/QuantizeLinear)

zerollzeng commented 1 year ago

Does the onnx model work with onnxruntime? if yes could you please share a link to the onnx model here, I can help create an internal bug to track. Thanks!

proevgenii commented 1 year ago

@zerollzeng Yes, all(usual onnx, after onnx simplifier and folded model) .onnx models works in onnxruntime For example here is q_model.onnx

zerollzeng commented 1 year ago

@zerollzeng Yes, all(usual onnx, after onnx simplifier and folded model) .onnx models works in onnxruntime For example here is q_model.onnx

Filed internal bug 4259499 for this. Thanks for reporting this!

zerollzeng commented 7 months ago

Issue fixed in TRT 10, closed.

NVIDIA / TensorRT

PTQ procedure to ViT model Error Code 2: Internal Error (Assertion start <= stop failed. ) #3139

Description

Steps To Reproduce

Exporting to ONNX