NVIDIA / TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, pruning, distillation, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.
https://nvidia.github.io/TensorRT-Model-Optimizer
Other
576 stars 43 forks source link

Error when Export TRT model from the Quantized ONNX #16

Open chuong98 opened 6 months ago

chuong98 commented 6 months ago

After successful quantizing and exporting ONNX models for ResNet18, using 2 different mode int8 and fp8, I am trying to export these ONNX models to TRT, but no luck so far. It returns Error No support layers:?

[05/22/2024-03:49:37] [E] Error[9]: Skipping tactic 0xdb2ceb83bdb264c9 due to exception Assertion dims1.d[i] == 1 || dims2.d[i] == 1 || dims1.d[i] == dims2.d[i] failed. [64] cannot broadcast with [224]
[05/22/2024-03:49:37] [E] Error[9]: Skipping tactic 0xc37005486323d39b due to exception Assertion dims1.d[i] == 1 || dims2.d[i] == 1 || dims1.d[i] == dims2.d[i] failed. [64] cannot broadcast with [224]
[05/22/2024-03:49:37] [W] [TRT] Engine generation failed with backend strategy 2.
Error message: [optimizer.cpp::computeCosts::4048] Error Code 10: Internal Error (Could not find any implementation for node conv1.weight + /conv1/weight_quantizer/QuantizeLinear + /conv1/Conv.).
Skipping this backend strategy.
[05/22/2024-03:49:37] [E] Error[2]: [standardEngineBuilder.cpp::makeEngineFromSubGraph::1545] Error Code 2: Internal Error (Engine generation failed because all backend strategies failed.)
[05/22/2024-03:49:37] [E] Engine could not be created from network
[05/22/2024-03:49:37] [E] Building engine failed
[05/22/2024-03:49:37] [E] Failed to create engine from model or file.
[05/22/2024-03:49:37] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v9300] # trtexec --onnx=cache/int8/resnet18_quantized_int8.onnx --saveEngine=cache/int8/resnet18_quantized_int8.trt --fp16 --builderOptimizationLevel=4 --int8 --staticPlugins=/workspace/examples/plugins/bin/FP8Conv2DPlugin.so --staticPlugins=/workspace/examples/plugins/bin/groupNormPlugin.so --workspace=48000 --timingCacheFile=timing.cache
chuong98 commented 6 months ago

Here is my command to export TRT:

fp8_plugin="/workspace/examples/plugins/bin/FP8Conv2DPlugin.so"
groupNorm_plugin="/workspace/examples/plugins/bin/groupNormPlugin.so"

# --minShapes=input:1x3x224x224 \
# --optShapes=input:16x3x224x224 \
# --maxShapes=input:32x3x224x224 \
# --shapes=input:1x3x224x224 \
export_params=$(cat <<-END
    --onnx=$weight_path \
    --saveEngine=$weight_dir/$new_filename \
    --fp16 --builderOptimizationLevel=4 \
    --int8 --staticPlugins=$fp8_plugin --staticPlugins=$groupNorm_plugin \
    --workspace=48000 --timingCacheFile=timing.cache
END
)

echo "Exporting TRT model ${new_filename}..."
trtexec $export_params

The onnx models can be downloaded from the Google Drive: https://drive.google.com/drive/folders/1kQEVN7FjLXD3XPF2Qrlr6gqNeiDVPup4?usp=sharing

riyadshairi979 commented 6 months ago

After successful quantizing and exporting ONNX models for ResNet18, using 2 different mode int8 and fp8

What command did you use to quantize the ONNX model?

chuong98 commented 6 months ago

@riyadshairi979 I attach the python script to quantize the model, following the documents quantize_model.txt (Please rename the file from .txt to.py, GitHub does not allow uploading *.py)

Essentially, this is the snippet code to quantize:

dataset_quant = create_dataset('imagenet', root='/data/imagenet', split='val', num_samples=512)
data_loader_quant = create_loader(dataset_val, 
                            input_size=(3, 224, 224), 
                            batch_size=args.batch_size, 
                            is_training=False, 
                            use_prefetcher=False)
# Define forward_loop. Please wrap the data loader in the forward_loop
def forward_loop(model):
    for batch in tqdm(data_loader_quant):
        input, target = batch
        input = input.to(device)
        model(input)
model_quant = mtq.quantize(model, config, forward_loop)
chuong98 commented 6 months ago

It turns out that the ResNet has Conv and BN layers, the Quantizer insert Quantize and DeQuantize layers between Conv and BN, leads to fail. I got it works by:

realAsma commented 6 months ago

It turns out that the ResNet has Conv and BN layers, the Quantizer insert Quantize and DeQuantize layers between Conv and BN, leads to fail. I got it works by:

@chuong98 I infer that you used INT8_SMOOTHQUANT_CFG. This config is for INT8 quantization with smoothquant algorithm and is intended to be used in models with only nn.Linear layers.

For quantizing CNN models to INT8, we need to use INT8_DEFAULT_CFG? Please see more about the quantization formats here.

If you are not happy with your current solution, would it be possible for you to try out INT8_DEFAULT_CFG and see if it works?

Thanks!!

chuong98 commented 6 months ago

@realAsma Thanks for your help. Yes, I missed the Note: The recommended configs given below are for LLM models. For CNN models, only INT8 quantization is supported. Please use quantization config INT8_DEFAULT_CFG for CNN models.

change the config INT8_DEFAULT_CFG works without manual fusing Conv-BN. Great. Then, compare the INT8_DEFAULT_CFG to the option trtexec --onnx= $MODEL --int8 that we commonly use, what are their difference? Are they the same? Thank you.

chuong98 commented 6 months ago

My experiment with TensorRT Optimize Model yield worse results than the default TensorRT trtexec: image

jingyu-ml commented 6 months ago

I noticed you used generate_fp8_scale for the fp8 ResNet. This function is necessary because torch.onnx.export currently doesn't support FP8 export for Conv layers. To work around this, we borrowed the int8 export logic. Then, at this link, we manually change all the int8 QDQ operations to FP8 in onnx graph.

We will soon release FP8 support in TensorRT out-of-the-box. Once it's available, you can use generate_fp8_scale to convert the model to a TensorRT engine without needing a graph surgeon. Additionally, once we verify that torch.dynamo_export works for FP8 Conv layers, we can try exporting without generate_fp8_scale.

chuong98 commented 6 months ago

Can you provide an example to quantize a ViT model, for example Swin?

realAsma commented 5 months ago

@chuong98 My apologies on the performance degradation. modelopt.torch.quantization speedup analysis have been focused on LLMs (deployed via TensorRT-LLM) and diffusion models (deployed via TRT).

It is quite possible for quantized models other than the LLMs or diffusion models to be slower than TRT's un-quantized baseline models. In this case, we recommend exporting the Pytorch model to ONNX first and then quantize the ONNX graph via modelopt.onnx.quantization. Please see examples for quantizing ONNX graph here.

Could you please try out quantizing the ONNX graph using modelopt.onnx.quantization instead?

korkland commented 4 months ago

My experiment with TensorRT Optimize Model yield worse results than the default TensorRT trtexec: image

I'm experiencing the same degradation with int8 vs. fp16 in RetinaNet, using the modelopt PyTorch API. In your table, you compared fused BN vs. non-fused BN. Did you use the smooth config there or the int8 default config?

Have you found a better approach for quantization that gives you better latency with more control over the quantization process (not the implicit one, which, according to the table, seems to bring significant improvement in latency)?