Open chuong98 opened 6 months ago
Here is my command to export TRT:
fp8_plugin="/workspace/examples/plugins/bin/FP8Conv2DPlugin.so"
groupNorm_plugin="/workspace/examples/plugins/bin/groupNormPlugin.so"
# --minShapes=input:1x3x224x224 \
# --optShapes=input:16x3x224x224 \
# --maxShapes=input:32x3x224x224 \
# --shapes=input:1x3x224x224 \
export_params=$(cat <<-END
--onnx=$weight_path \
--saveEngine=$weight_dir/$new_filename \
--fp16 --builderOptimizationLevel=4 \
--int8 --staticPlugins=$fp8_plugin --staticPlugins=$groupNorm_plugin \
--workspace=48000 --timingCacheFile=timing.cache
END
)
echo "Exporting TRT model ${new_filename}..."
trtexec $export_params
The onnx models can be downloaded from the Google Drive: https://drive.google.com/drive/folders/1kQEVN7FjLXD3XPF2Qrlr6gqNeiDVPup4?usp=sharing
After successful quantizing and exporting ONNX models for ResNet18, using 2 different mode int8 and fp8
What command did you use to quantize the ONNX model?
@riyadshairi979 I attach the python script to quantize the model, following the documents quantize_model.txt (Please rename the file from .txt to.py, GitHub does not allow uploading *.py)
Essentially, this is the snippet code to quantize:
dataset_quant = create_dataset('imagenet', root='/data/imagenet', split='val', num_samples=512)
data_loader_quant = create_loader(dataset_val,
input_size=(3, 224, 224),
batch_size=args.batch_size,
is_training=False,
use_prefetcher=False)
# Define forward_loop. Please wrap the data loader in the forward_loop
def forward_loop(model):
for batch in tqdm(data_loader_quant):
input, target = batch
input = input.to(device)
model(input)
model_quant = mtq.quantize(model, config, forward_loop)
It turns out that the ResNet has Conv and BN layers, the Quantizer insert Quantize and DeQuantize layers between Conv and BN, leads to fail. I got it works by:
get_in8_config
, https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/diffusers/utils.py It turns out that the ResNet has Conv and BN layers, the Quantizer insert Quantize and DeQuantize layers between Conv and BN, leads to fail. I got it works by:
- fuse the BN and Conv before Quantize.
- the default Quantize cfg doesn't work. Replace the config from the Diffuser then it works:
get_in8_config
, https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/diffusers/utils.py
@chuong98 I infer that you used INT8_SMOOTHQUANT_CFG
. This config is for INT8 quantization with smoothquant algorithm and is intended to be used in models with only nn.Linear
layers.
For quantizing CNN models to INT8, we need to use INT8_DEFAULT_CFG
?
Please see more about the quantization formats here.
If you are not happy with your current solution, would it be possible for you to try out INT8_DEFAULT_CFG
and see if it works?
Thanks!!
@realAsma Thanks for your help. Yes, I missed the Note: The recommended configs given below are for LLM models. For CNN models, only INT8 quantization is supported. Please use quantization config INT8_DEFAULT_CFG for CNN models.
change the config INT8_DEFAULT_CFG
works without manual fusing Conv-BN. Great.
Then, compare the INT8_DEFAULT_CFG
to the option trtexec --onnx= $MODEL --int8
that we commonly use, what are their difference? Are they the same? Thank you.
My experiment with TensorRT Optimize Model yield worse results than the default TensorRT trtexec
:
I noticed you used generate_fp8_scale
for the fp8 ResNet. This function is necessary because torch.onnx.export
currently doesn't support FP8 export for Conv layers. To work around this, we borrowed the int8 export logic. Then, at this link, we manually change all the int8 QDQ operations to FP8 in onnx graph.
We will soon release FP8 support in TensorRT out-of-the-box. Once it's available, you can use generate_fp8_scale
to convert the model to a TensorRT engine without needing a graph surgeon. Additionally, once we verify that torch.dynamo_export
works for FP8 Conv layers, we can try exporting without generate_fp8_scale.
Can you provide an example to quantize a ViT model, for example Swin?
@chuong98
My apologies on the performance degradation. modelopt.torch.quantization
speedup analysis have been focused on LLMs (deployed via TensorRT-LLM) and diffusion models (deployed via TRT).
It is quite possible for quantized models other than the LLMs or diffusion models to be slower than TRT's un-quantized baseline models. In this case, we recommend exporting the Pytorch model to ONNX first and then quantize the ONNX graph via modelopt.onnx.quantization. Please see examples for quantizing ONNX graph here.
Could you please try out quantizing the ONNX graph using modelopt.onnx.quantization
instead?
My experiment with TensorRT Optimize Model yield worse results than the default TensorRT
trtexec
:
I'm experiencing the same degradation with int8 vs. fp16 in RetinaNet, using the modelopt PyTorch API. In your table, you compared fused BN vs. non-fused BN. Did you use the smooth config there or the int8 default config?
Have you found a better approach for quantization that gives you better latency with more control over the quantization process (not the implicit one, which, according to the table, seems to bring significant improvement in latency)?
After successful quantizing and exporting ONNX models for ResNet18, using 2 different mode
int8
andfp8
, I am trying to export these ONNX models to TRT, but no luck so far. It returns Error No support layers:?