NVIDIA / TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, pruning, distillation, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.
https://nvidia.github.io/TensorRT-Model-Optimizer
Other
581 stars 44 forks source link

Error when Export TRT model from the Quantized ONNX #73

Closed DataXujing closed 2 months ago

DataXujing commented 2 months ago

[09/13/2024-21:22:43] [V] [TRT] Graph optimization time: 2.43234 seconds. [09/13/2024-21:22:43] [I] [TRT] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32. [09/13/2024-21:22:43] [V] [TRT] Building graph using backend strategy 2 [09/13/2024-21:22:43] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored. [09/13/2024-21:22:43] [V] [TRT] Constructing optimization profile number 0 [1/1]. [09/13/2024-21:22:43] [V] [TRT] Applying generic optimizations to the graph for inference. [09/13/2024-21:22:43] [V] [TRT] Reserving memory for host IO tensors. Host: 0 bytes [09/13/2024-21:22:43] [V] [TRT] =============== Computing costs for model.0.conv.weight + model.0.conv.weight_QuantizeLinear + /model.0/conv/Conv + PWN(/model.0/act/Sigmoid, /model.0/act/Mul) [09/13/2024-21:22:43] [V] [TRT] Autotuning format combination: Int8(1228800,409600,640,1) -> Int8(1638400,102400,320,1) [09/13/2024-21:22:43] [V] [TRT] Skipping CaskConvolution: No valid tactics for model.0.conv.weight + model.0.conv.weight_QuantizeLinear + /model.0/conv/Conv + PWN(/model.0/act/Sigmoid, /model.0/act/Mul) [09/13/2024-21:22:43] [V] [TRT] Skipping CaskFlattenConvolution: No valid tactics for model.0.conv.weight + model.0.conv.weight_QuantizeLinear + /model.0/conv/Conv + PWN(/model.0/act/Sigmoid, /model.0/act/Mul) [09/13/2024-21:22:43] [V] [TRT] Autotuning format combination: Int8(1228800,409600,640,1) -> Int8(102400,102400:32,320,1) [09/13/2024-21:22:43] [V] [TRT] Skipping CaskConvolution: No valid tactics for model.0.conv.weight + model.0.conv.weight_QuantizeLinear + /model.0/conv/Conv + PWN(/model.0/act/Sigmoid, /model.0/act/Mul) [09/13/2024-21:22:43] [V] [TRT] Skipping CaskFlattenConvolution: No valid tactics for model.0.conv.weight + model.0.conv.weight_QuantizeLinear + /model.0/conv/Conv + PWN(/model.0/act/Sigmoid, /model.0/act/Mul) [09/13/2024-21:22:43] [V] [TRT] Autotuning format combination: Int8(409600,409600:4,640,1) -> Int8(409600,102400:4,320,1) [09/13/2024-21:22:43] [E] Error[2]: [weightsPtr.h::nvinfer1::WeightsPtr::values::182] Error Code 2: Internal Error (Assertion type() == expectedDataType() failed. ) [09/13/2024-21:22:43] [E] Engine could not be created from network [09/13/2024-21:22:43] [E] Building engine failed [09/13/2024-21:22:43] [E] Failed to create engine from model or file. [09/13/2024-21:22:43] [E] Engine set up failed

riyadshairi979 commented 2 months ago

Follow the example here to quantize the model and then compile with trtexec or TensorRT python APIs. If the problem persist, please share the input ONNX model and commands to reproduce the issue.

LaVieEnRoseSMZ commented 2 months ago

I notice the difference now. RTX A6000 is ampere, but RTX 6000 is ada. Thanks for your great work. And it is feasible to run FP8 on a small Vram GPU like 24G L4 for flux-dev now.