NVIDIA / TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, pruning, distillation, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.
https://nvidia.github.io/TensorRT-Model-Optimizer
Other
576 stars 43 forks source link

Cat not build TRT Engine after onnx int8 quantize #38

Open DefTruth opened 4 months ago

DefTruth commented 4 months ago

/root/anaconda3/envs/modelopt/lib/python3.10/site-packages/modelopt/onnx/quantization/int4.py:27: UserWarning: Using slower INT4 ONNX quantization using numpy. Install JAX (https://jax.readthedocs.io/en/latest/installation.html) for faster quantization: jax requires jaxlib to be installed. See https://github.com/google/jax#installation for installation instructions. warnings.warn( Loading extension modelopt_round_and_pack_ext...

INFO:root:Model encoder-vd-512-10-skip-mha.onnx with opset_version 17 is loaded. INFO:root:Quantization Mode: int8 INFO:root:Quantizable op types in the model: ['Add', 'AveragePool', 'Mul', 'Conv'] INFO:root:Building non-residual Add input map ... INFO:root:Searching for hard-coded patterns like MHA, LayerNorm, etc. to avoid quantization. INFO:root:Building KGEN/CASK targeted partitions ... INFO:root:Classifying the partition nodes ... INFO:root:Total number of nodes: 507 INFO:root:Skipped node count: 0 WARNING:root:Please consider to run pre-processing before quantization. Refer to example: https://github.com/microsoft/onnxruntime-inference-examples/blob/main/quantization/image_classification/cpu/ReadMe.md Collecting tensor data and making histogram ... 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 286/286 [03:23<00:00, 1.41it/s] Finding optimal threshold for each tensor using 'entropy' algorithm ... Number of tensors : 286 Number of histogram bins : 128 (The number may increase depends on the data it collects) Number of quantized bins : 128 WARNING:root:Please consider pre-processing before quantization. See https://github.com/microsoft/onnxruntime-inference-examples/blob/main/quantization/image_classification/cpu/ReadMe.md INFO:root:Deleting QDQ nodes from marked inputs to make certain operations fusible ... INFO:root:Quantized onnx model is saved as encoder-w8a8-int8.onnx INFO:root:Total number of quantized nodes: 166 INFO:root:Quantized node types: {'Add', 'AveragePool', 'Sigmoid', 'Reshape', 'Mul', 'Shape', 'Conv'}


- build engine  (FAILED)
```bash
trtexec --onnx=encoder-w8a8-int8.onnx --saveEngine=encoder.w8a8.int8.engine --memPoolSize=workspace:40G --stronglyTyped
[07/11/2024-17:10:11] [E] Error[2]: Error Code: 2: Assertion static_cast<size_t>(c) < mSet.size() failed.
[07/11/2024-17:10:11] [E] Error[2]: [cgraph.h::assertIsValidSubscript::161] Error Code 2: Internal Error (Assertion static_cast<size_t>(c) < mSet.size() failed. )
[07/11/2024-17:10:11] [E] Engine could not be created from network
[07/11/2024-17:10:11] [E] Building engine failed

modelopt 0.13.1 TRT 10.1.0

DefTruth commented 4 months ago

after add --nodes_to_exclude "AveragePool" --op_types_to_exclude "AveragePool", the engine can build successfully.

DefTruth commented 4 months ago

also, CPU is OOM when the BS is large

DefTruth commented 4 months ago

another model will raise a new error after onnx quantize:

[07/12/2024-18:29:08] [I] Finished parsing network model. Parse time: 0.633489
[07/12/2024-18:29:09] [W] [TRT] Calibrator won't be used in explicit quantization mode. Please insert Quantize/Dequantize layers to indicate which tensors to quantize/dequantize.
[07/12/2024-18:29:10] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[07/12/2024-18:33:45] [E] Error[10]: Error Code: 10: Could not find any implementation for node /Concat_17slice.
[07/12/2024-18:33:45] [E] Error[10]: IBuilder::buildSerializedNetwork: Error Code 10: Internal Error (Could not find any implementation for node /Concat_17slice.)
[07/12/2024-18:33:45] [E] Engine could not be created from network
[07/12/2024-18:33:45] [E] Building engine failed
[07/12/2024-18:33:45] [E] Failed to create engine from model or file.
[07/12/2024-18:33:45] [E] Engine set up failed
cjluo-omniml commented 4 months ago

This can likely be a TRT bug. If there is a way to reproduce, we can help dig deeper into the issue