Closed proevgenii closed 7 months ago
I feel like the error is in the Q/QD placement, @ttyio what do you think?
@proevgenii Just want to double confirm: you make use of https://github.com/NVIDIA/TensorRT/blob/a167852705d74bcc619d8fad0af4b9e4d84472fc/quickstart/quantization_tutorial/qat-ptq-workflow.ipynb for your own ViT model, am I correct?
Yes that's right I am using this tutorial to quantize my pre-trained ViT model from timm
@ttyio ^ ^
@proevgenii Do you have the TRT verbose build log? I want to check which operator failed. Is this slice layer? thanks!
@ttyio Yep, here is two log files, log.txt log_outputIOFormats.txt
@proevgenii , I did not see Internal Error (Assertion start <= stop failed. )
from your log?
@ttyio Yea,
But, here is screenshot how I get this log.txt, and command in cell also produced output which contains
Internal Error (Assertion start <= stop failed. )
What about --inputIOFormats=int8:chw --outputIOFormats=int8:chw --int8
, should I use it or not, and why it gives me different error?
@ttyio Any updates?
@proevgenii sorry for the delay response, cannot tell why from the log and screenshot. could you also share the onnx file? thanks!
--inputIOFormats=int8:chw --outputIOFormats=int8:chw --int8
this is used to use int8:chw
datatype and layout for network input and output, it will save some reformat at the network boundary (you need interpret those int8 data out of trt, this is useful when other part of your workflow accept int8 as input/output). By doing this you may change the precision used for the layers that near the network output tensors. Need check your onnx to see the detail reason.
@ttyio Here's onnx model
@proevgenii , could you WAR by folding constant first? thanks!
polygraphy surgeon sanitize q_model.onnx --fold-constants --output q_model_folded.onnx
@ttyio Looks like it simplify onnx model graph, like onnx_simplifier
which I also tried.
But I run this:
polygraphy surgeon sanitize q_model.onnx --fold-constants --output q_model_folded.onnx
Then I run, trtexec
which failed with error:
[08/23/2023-10:32:14] [E] Error[10]: Could not find any implementation for node patch_embed.proj.weight + /patch_embed/proj/_weight_quantizer/QuantizeLinear + /patch_embed/proj/Conv.
[08/23/2023-10:32:14] [E] Error[10]: [optimizer.cpp::computeCosts::3869] Error Code 10: Internal Error (Could not find any implementation for node patch_embed.proj.weight + /patch_embed/proj/_weight_quantizer/QuantizeLinear + /patch_embed/proj/Conv.)
When I add flag: --inputIOFormats=int8:chw --outputIOFormats=int8:chw --int8
Error changes to:
[08/23/2023-10:40:41] [W] [TRT] Calibrator won't be used in explicit precision mode. Use quantization aware training to generate network with Quantize/Dequantize nodes.
[08/23/2023-10:40:41] [E] Error[1]: [qdqGraphOptimizer.cpp::quantizePaths::3913] Error Code 1: Internal Error (Node /patch_embed/proj/_input_quantizer/QuantizeLinear cannot be quantized by input. You might want to add a DQ node before /patch_embed/proj/_input_quantizer/QuantizeLinear)
Does the onnx model work with onnxruntime? if yes could you please share a link to the onnx model here, I can help create an internal bug to track. Thanks!
@zerollzeng Yes, all(usual onnx, after onnx simplifier and folded model) .onnx models works in onnxruntime For example here is q_model.onnx
@zerollzeng Yes, all(usual onnx, after onnx simplifier and folded model) .onnx models works in onnxruntime For example here is q_model.onnx
Filed internal bug 4259499 for this. Thanks for reporting this!
Issue fixed in TRT 10, closed.
Description
I'm trying to apply PTQ and QAT procedure to ViT model Using this example notebook: https://github.com/NVIDIA/TensorRT/blob/a167852705d74bcc619d8fad0af4b9e4d84472fc/quickstart/quantization_tutorial/qat-ptq-workflow.ipynb And when I'm trying to convert model to .trt Get error:
Steps To Reproduce
Exporting to ONNX
dummy_input = torch.randn(1, 3, 224, 224, device='cuda') torch.onnx.export( q_model, dummy_input, "model.onnx", input_names=["input"], output_names=["output"], verbose=False, opset_version=13, do_constant_folding = False)
!trtexec --onnx=model.onnx \ --explicitBatch \ --workspace=8000 \ --saveEngine=model.trt \ --int8
&&&& RUNNING TensorRT.trtexec [TensorRT v8601] # trtexec --onnx=model.onnx --explicitBatch --workspace=8000 --saveEngine=model.trt --int8 [07/17/2023-11:30:46] [W] --explicitBatch flag has been deprecated and has no effect! [07/17/2023-11:30:46] [W] Explicit batch dim is automatically enabled if input model is ONNX or if dynamic shapes are provided when the engine is built. [07/17/2023-11:30:46] [W] --workspace flag has been deprecated by --memPoolSize flag. [07/17/2023-11:30:46] [I] === Model Options === [07/17/2023-11:30:46] [I] Format: ONNX [07/17/2023-11:30:46] [I] Model: model.onnx [07/17/2023-11:30:46] [I] Output: [07/17/2023-11:30:46] [I] === Build Options === [07/17/2023-11:30:46] [I] Max batch: explicit batch [07/17/2023-11:30:46] [I] Memory Pools: workspace: 8000 MiB, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default [07/17/2023-11:30:46] [I] minTiming: 1 [07/17/2023-11:30:46] [I] avgTiming: 8 [07/17/2023-11:30:46] [I] Precision: FP32+INT8 [07/17/2023-11:30:46] [I] LayerPrecisions: [07/17/2023-11:30:46] [I] Layer Device Types: [07/17/2023-11:30:46] [I] Calibration: Dynamic [07/17/2023-11:30:46] [I] Refit: Disabled [07/17/2023-11:30:46] [I] Version Compatible: Disabled [07/17/2023-11:30:46] [I] TensorRT runtime: full [07/17/2023-11:30:46] [I] Lean DLL Path: [07/17/2023-11:30:46] [I] Tempfile Controls: { in_memory: allow, temporary: allow } [07/17/2023-11:30:46] [I] Exclude Lean Runtime: Disabled [07/17/2023-11:30:46] [I] Sparsity: Disabled [07/17/2023-11:30:46] [I] Safe mode: Disabled [07/17/2023-11:30:46] [I] Build DLA standalone loadable: Disabled [07/17/2023-11:30:46] [I] Allow GPU fallback for DLA: Disabled [07/17/2023-11:30:46] [I] DirectIO mode: Disabled [07/17/2023-11:30:46] [I] Restricted mode: Disabled [07/17/2023-11:30:46] [I] Skip inference: Disabled [07/17/2023-11:30:46] [I] Save engine: model.trt [07/17/2023-11:30:46] [I] Load engine: [07/17/2023-11:30:46] [I] Profiling verbosity: 0 [07/17/2023-11:30:46] [I] Tactic sources: Using default tactic sources [07/17/2023-11:30:46] [I] timingCacheMode: local [07/17/2023-11:30:46] [I] timingCacheFile: [07/17/2023-11:30:46] [I] Heuristic: Disabled [07/17/2023-11:30:46] [I] Preview Features: Use default preview flags. [07/17/2023-11:30:46] [I] MaxAuxStreams: -1 [07/17/2023-11:30:46] [I] BuilderOptimizationLevel: -1 [07/17/2023-11:30:46] [I] Input(s)s format: fp32:CHW [07/17/2023-11:30:46] [I] Output(s)s format: fp32:CHW [07/17/2023-11:30:46] [I] Input build shapes: model [07/17/2023-11:30:46] [I] Input calibration shapes: model [07/17/2023-11:30:46] [I] === System Options === [07/17/2023-11:30:46] [I] Device: 0 [07/17/2023-11:30:46] [I] DLACore: [07/17/2023-11:30:46] [I] Plugins: [07/17/2023-11:30:46] [I] setPluginsToSerialize: [07/17/2023-11:30:46] [I] dynamicPlugins: [07/17/2023-11:30:46] [I] ignoreParsedPluginLibs: 0 [07/17/2023-11:30:46] [I] [07/17/2023-11:30:46] [I] === Inference Options === [07/17/2023-11:30:46] [I] Batch: Explicit [07/17/2023-11:30:46] [I] Input inference shapes: model [07/17/2023-11:30:46] [I] Iterations: 10 [07/17/2023-11:30:46] [I] Duration: 3s (+ 200ms warm up) [07/17/2023-11:30:46] [I] Sleep time: 0ms [07/17/2023-11:30:46] [I] Idle time: 0ms [07/17/2023-11:30:46] [I] Inference Streams: 1 [07/17/2023-11:30:46] [I] ExposeDMA: Disabled [07/17/2023-11:30:46] [I] Data transfers: Enabled [07/17/2023-11:30:46] [I] Spin-wait: Disabled [07/17/2023-11:30:46] [I] Multithreading: Disabled [07/17/2023-11:30:46] [I] CUDA Graph: Disabled [07/17/2023-11:30:46] [I] Separate profiling: Disabled [07/17/2023-11:30:46] [I] Time Deserialize: Disabled [07/17/2023-11:30:46] [I] Time Refit: Disabled [07/17/2023-11:30:46] [I] NVTX verbosity: 0 [07/17/2023-11:30:46] [I] Persistent Cache Ratio: 0 [07/17/2023-11:30:46] [I] Inputs: [07/17/2023-11:30:46] [I] === Reporting Options === [07/17/2023-11:30:46] [I] Verbose: Disabled [07/17/2023-11:30:46] [I] Averages: 10 inferences [07/17/2023-11:30:46] [I] Percentiles: 90,95,99 [07/17/2023-11:30:46] [I] Dump refittable layers:Disabled [07/17/2023-11:30:46] [I] Dump output: Disabled [07/17/2023-11:30:46] [I] Profile: Disabled [07/17/2023-11:30:46] [I] Export timing to JSON file: [07/17/2023-11:30:46] [I] Export output to JSON file: [07/17/2023-11:30:46] [I] Export profile to JSON file: [07/17/2023-11:30:46] [I] [07/17/2023-11:30:46] [I] === Device Information === [07/17/2023-11:30:46] [I] Selected Device: Tesla T4 [07/17/2023-11:30:46] [I] Compute Capability: 7.5 [07/17/2023-11:30:46] [I] SMs: 40 [07/17/2023-11:30:46] [I] Device Global Memory: 15109 MiB [07/17/2023-11:30:46] [I] Shared Memory per SM: 64 KiB [07/17/2023-11:30:46] [I] Memory Bus Width: 256 bits (ECC enabled) [07/17/2023-11:30:46] [I] Application Compute Clock Rate: 1.59 GHz [07/17/2023-11:30:46] [I] Application Memory Clock Rate: 5.001 GHz [07/17/2023-11:30:46] [I] [07/17/2023-11:30:46] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at. [07/17/2023-11:30:46] [I] [07/17/2023-11:30:46] [I] TensorRT version: 8.6.1 [07/17/2023-11:30:46] [I] Loading standard plugins [07/17/2023-11:30:46] [I] [TRT] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 19, GPU 8588 (MiB) [07/17/2023-11:30:55] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +888, GPU +172, now: CPU 983, GPU 8760 (MiB) [07/17/2023-11:30:55] [I] Start parsing network model. [07/17/2023-11:30:56] [I] [TRT] ---------------------------------------------------------------- [07/17/2023-11:30:56] [I] [TRT] Input filename: model.onnx [07/17/2023-11:30:56] [I] [TRT] ONNX IR version: 0.0.7 [07/17/2023-11:30:56] [I] [TRT] Opset version: 13 [07/17/2023-11:30:56] [I] [TRT] Producer name: pytorch [07/17/2023-11:30:56] [I] [TRT] Producer version: 2.1.0 [07/17/2023-11:30:56] [I] [TRT] Domain:
[07/17/2023-11:30:56] [I] [TRT] Model version: 0 [07/17/2023-11:30:56] [I] [TRT] Doc string:
[07/17/2023-11:30:56] [I] [TRT] ---------------------------------------------------------------- [07/17/2023-11:30:56] [W] [TRT] onnx2trt_utils.cpp:514: Your ONNX model has been generated with double-typed weights, while TensorRT does not natively support double. Attempting to cast down to float. [07/17/2023-11:30:56] [W] [TRT] onnx2trt_utils.cpp:374: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32. [07/17/2023-11:30:56] [I] Finished parsing network model. Parse time: 0.932966 [07/17/2023-11:30:56] [I] FP32 and INT8 precisions have been specified - more performance might be enabled by additionally specifying --fp16 or --best [07/17/2023-11:30:56] [I] [TRT] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32. [07/17/2023-11:30:56] [W] [TRT] Calibrator won't be used in explicit precision mode. Use quantization aware training to generate network with Quantize/Dequantize nodes. [07/17/2023-11:30:57] [E] Error[2]: [dims.h::volume::47] Error Code 2: Internal Error (Assertion start <= stop failed. ) [07/17/2023-11:30:57] [E] Engine could not be created from network [07/17/2023-11:30:57] [E] Building engine failed [07/17/2023-11:30:57] [E] Failed to create engine from model or file. [07/17/2023-11:30:57] [E] Engine set up failed &&&& FAILED TensorRT.trtexec [TensorRT v8601] # trtexec --onnx=model.onnx --explicitBatch --workspace=8000 --saveEngine=model.trt --int8
!trtexec --onnx=model.onnx \ --explicitBatch \ --workspace=8000 \ --saveEngine=model.trt \ --inputIOFormats=int8:chw --outputIOFormats=int8:chw --int8
&&&& RUNNING TensorRT.trtexec [TensorRT v8601] # trtexec --onnx=model.onnx --explicitBatch --workspace=8000 --saveEngine=model.trt --inputIOFormats=int8:chw --outputIOFormats=int8:chw --int8 [07/17/2023-12:22:42] [W] --explicitBatch flag has been deprecated and has no effect! [07/17/2023-12:22:42] [W] Explicit batch dim is automatically enabled if input model is ONNX or if dynamic shapes are provided when the engine is built. [07/17/2023-12:22:42] [W] --workspace flag has been deprecated by --memPoolSize flag. [07/17/2023-12:22:42] [I] === Model Options === [07/17/2023-12:22:42] [I] Format: ONNX [07/17/2023-12:22:42] [I] Model: model.onnx [07/17/2023-12:22:42] [I] Output: [07/17/2023-12:22:42] [I] === Build Options === [07/17/2023-12:22:42] [I] Max batch: explicit batch [07/17/2023-12:22:42] [I] Memory Pools: workspace: 8000 MiB, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default [07/17/2023-12:22:42] [I] minTiming: 1 [07/17/2023-12:22:42] [I] avgTiming: 8 [07/17/2023-12:22:42] [I] Precision: FP32+INT8 [07/17/2023-12:22:42] [I] LayerPrecisions: [07/17/2023-12:22:42] [I] Layer Device Types: [07/17/2023-12:22:42] [I] Calibration: Dynamic [07/17/2023-12:22:42] [I] Refit: Disabled [07/17/2023-12:22:42] [I] Version Compatible: Disabled [07/17/2023-12:22:42] [I] TensorRT runtime: full [07/17/2023-12:22:42] [I] Lean DLL Path: [07/17/2023-12:22:42] [I] Tempfile Controls: { in_memory: allow, temporary: allow } [07/17/2023-12:22:42] [I] Exclude Lean Runtime: Disabled [07/17/2023-12:22:42] [I] Sparsity: Disabled [07/17/2023-12:22:42] [I] Safe mode: Disabled [07/17/2023-12:22:42] [I] Build DLA standalone loadable: Disabled [07/17/2023-12:22:42] [I] Allow GPU fallback for DLA: Disabled [07/17/2023-12:22:42] [I] DirectIO mode: Disabled [07/17/2023-12:22:42] [I] Restricted mode: Disabled [07/17/2023-12:22:42] [I] Skip inference: Disabled [07/17/2023-12:22:42] [I] Save engine: model.trt [07/17/2023-12:22:42] [I] Load engine: [07/17/2023-12:22:42] [I] Profiling verbosity: 0 [07/17/2023-12:22:42] [I] Tactic sources: Using default tactic sources [07/17/2023-12:22:42] [I] timingCacheMode: local [07/17/2023-12:22:42] [I] timingCacheFile: [07/17/2023-12:22:42] [I] Heuristic: Disabled [07/17/2023-12:22:42] [I] Preview Features: Use default preview flags. [07/17/2023-12:22:42] [I] MaxAuxStreams: -1 [07/17/2023-12:22:42] [I] BuilderOptimizationLevel: -1 [07/17/2023-12:22:42] [I] Input(s): int8:chw [07/17/2023-12:22:42] [I] Output(s): int8:chw [07/17/2023-12:22:42] [I] Input build shapes: model [07/17/2023-12:22:42] [I] Input calibration shapes: model [07/17/2023-12:22:42] [I] === System Options === [07/17/2023-12:22:42] [I] Device: 0 [07/17/2023-12:22:42] [I] DLACore: [07/17/2023-12:22:42] [I] Plugins: [07/17/2023-12:22:42] [I] setPluginsToSerialize: [07/17/2023-12:22:42] [I] dynamicPlugins: [07/17/2023-12:22:42] [I] ignoreParsedPluginLibs: 0 [07/17/2023-12:22:42] [I] [07/17/2023-12:22:42] [I] === Inference Options === [07/17/2023-12:22:42] [I] Batch: Explicit [07/17/2023-12:22:42] [I] Input inference shapes: model [07/17/2023-12:22:42] [I] Iterations: 10 [07/17/2023-12:22:42] [I] Duration: 3s (+ 200ms warm up) [07/17/2023-12:22:42] [I] Sleep time: 0ms [07/17/2023-12:22:42] [I] Idle time: 0ms [07/17/2023-12:22:42] [I] Inference Streams: 1 [07/17/2023-12:22:42] [I] ExposeDMA: Disabled [07/17/2023-12:22:42] [I] Data transfers: Enabled [07/17/2023-12:22:42] [I] Spin-wait: Disabled [07/17/2023-12:22:42] [I] Multithreading: Disabled [07/17/2023-12:22:42] [I] CUDA Graph: Disabled [07/17/2023-12:22:42] [I] Separate profiling: Disabled [07/17/2023-12:22:42] [I] Time Deserialize: Disabled [07/17/2023-12:22:42] [I] Time Refit: Disabled [07/17/2023-12:22:42] [I] NVTX verbosity: 0 [07/17/2023-12:22:42] [I] Persistent Cache Ratio: 0 [07/17/2023-12:22:42] [I] Inputs: [07/17/2023-12:22:42] [I] === Reporting Options === [07/17/2023-12:22:42] [I] Verbose: Disabled [07/17/2023-12:22:42] [I] Averages: 10 inferences [07/17/2023-12:22:42] [I] Percentiles: 90,95,99 [07/17/2023-12:22:42] [I] Dump refittable layers:Disabled [07/17/2023-12:22:42] [I] Dump output: Disabled [07/17/2023-12:22:42] [I] Profile: Disabled [07/17/2023-12:22:42] [I] Export timing to JSON file: [07/17/2023-12:22:42] [I] Export output to JSON file: [07/17/2023-12:22:42] [I] Export profile to JSON file: [07/17/2023-12:22:42] [I] [07/17/2023-12:22:42] [I] === Device Information === [07/17/2023-12:22:42] [I] Selected Device: Tesla T4 [07/17/2023-12:22:42] [I] Compute Capability: 7.5 [07/17/2023-12:22:42] [I] SMs: 40 [07/17/2023-12:22:42] [I] Device Global Memory: 15109 MiB [07/17/2023-12:22:42] [I] Shared Memory per SM: 64 KiB [07/17/2023-12:22:42] [I] Memory Bus Width: 256 bits (ECC enabled) [07/17/2023-12:22:42] [I] Application Compute Clock Rate: 1.59 GHz [07/17/2023-12:22:42] [I] Application Memory Clock Rate: 5.001 GHz [07/17/2023-12:22:42] [I] [07/17/2023-12:22:42] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at. [07/17/2023-12:22:42] [I] [07/17/2023-12:22:42] [I] TensorRT version: 8.6.1 [07/17/2023-12:22:42] [I] Loading standard plugins [07/17/2023-12:22:42] [I] [TRT] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 19, GPU 8588 (MiB) [07/17/2023-12:22:51] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +888, GPU +172, now: CPU 983, GPU 8760 (MiB) [07/17/2023-12:22:51] [I] Start parsing network model. [07/17/2023-12:22:52] [I] [TRT] ---------------------------------------------------------------- [07/17/2023-12:22:52] [I] [TRT] Input filename: model.onnx [07/17/2023-12:22:52] [I] [TRT] ONNX IR version: 0.0.7 [07/17/2023-12:22:52] [I] [TRT] Opset version: 13 [07/17/2023-12:22:52] [I] [TRT] Producer name: pytorch [07/17/2023-12:22:52] [I] [TRT] Producer version: 2.1.0 [07/17/2023-12:22:52] [I] [TRT] Domain:
[07/17/2023-12:22:52] [I] [TRT] Model version: 0 [07/17/2023-12:22:52] [I] [TRT] Doc string:
[07/17/2023-12:22:52] [I] [TRT] ---------------------------------------------------------------- [07/17/2023-12:22:52] [W] [TRT] onnx2trt_utils.cpp:514: Your ONNX model has been generated with double-typed weights, while TensorRT does not natively support double. Attempting to cast down to float. [07/17/2023-12:22:52] [W] [TRT] onnx2trt_utils.cpp:374: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32. [07/17/2023-12:22:52] [I] Finished parsing network model. Parse time: 0.918563 [07/17/2023-12:22:52] [I] FP32 and INT8 precisions have been specified - more performance might be enabled by additionally specifying --fp16 or --best [07/17/2023-12:22:52] [I] [TRT] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32. [07/17/2023-12:22:52] [W] [TRT] Calibrator won't be used in explicit precision mode. Use quantization aware training to generate network with Quantize/Dequantize nodes. [07/17/2023-12:22:52] [E] Error[1]: [qdqGraphOptimizer.cpp::quantizePaths::3913] Error Code 1: Internal Error (Node /patch_embed/proj/_input_quantizer/QuantizeLinear cannot be quantized by input. You might want to add a DQ node before /patch_embed/proj/_input_quantizer/QuantizeLinear ) [07/17/2023-12:22:52] [E] Engine could not be created from network [07/17/2023-12:22:52] [E] Building engine failed [07/17/2023-12:22:52] [E] Failed to create engine from model or file. [07/17/2023-12:22:52] [E] Engine set up failed &&&& FAILED TensorRT.trtexec [TensorRT v8601] # trtexec --onnx=model.onnx --explicitBatch --workspace=8000 --saveEngine=model.trt --inputIOFormats=int8:chw --outputIOFormats=int8:chw --int8
[W] 'colored' module is not installed, will not use colors when logging. To enable colors, please install the 'colored' module: python3 -m pip install colored [I] RUNNING | Command: /usr/local/bin/polygraphy run model.onnx --onnxrt [I] onnxrt-runner-N0-07/17/23-12:24:47 | Activating and starting inference [I] Creating ONNX-Runtime Inference Session with providers: ['CPUExecutionProvider'] [I] onnxrt-runner-N0-07/17/23-12:24:47 ---- Inference Input(s) ---- {input [dtype=float32, shape=(1, 3, 224, 224)]} [I] onnxrt-runner-N0-07/17/23-12:24:47 ---- Inference Output(s) ---- {output [dtype=float32, shape=(1, 6)]} [I] onnxrt-runner-N0-07/17/23-12:24:47 | Completed 1 iteration(s) in 28.23 ms | Average inference time: 28.23 ms. [I] PASSED | Runtime: 1.435s | Command: /usr/local/bin/polygraphy run model.onnx --onnxrt