Error when export ONNX for FP8

NVIDIA / TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, pruning, distillation, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.

Other

581 stars 44 forks source link

I am testing exporting model ResNet18 provided by Timm, and I use Docker so the result is reproducable.

For mode mtq.INT8_SMOOTHQUANT_CFG, I can export the Onnx file from Quantized model successfully.

However, for mode mtq.FP8_DEFAULT_CFG, exporting Onnx model has error:

torch.onnx.errors.SymbolicValueError: Unsupported: ONNX export of convolution for kernel of unknown shape.  [Caused by the value 'input defined in (%input : Float(*, *, *, *, strides=[150528, 50176, 224, 1], requires_grad=0, device=cuda:0) = trt::TRT_FP8DequantizeLinear(%189, %190), scope: timm.models.resnet.ResNet::/modelopt.torch.opt.dynamic.QuantConv2d::conv1/modelopt.torch.quantization.nn.modules.tensor_quantizer.TensorQuantizer::input_quantizer # /usr/local/lib/python3.10/dist-packages/torch/autograd/function.py:553:0

I used the standard torch.onnx.export as in the example:

model.eval()
torch.onnx.export(model, input_tensor, onnx_path,
                    input_names=input_names, 
                    output_names=output_names,
                    dynamic_axes=dynamic_axes,
                    opset_version=opset, 
                    do_constant_folding=do_constant_folding)

Both models were quantized successfully, FP8 has better accuracy than INT8, but it is useless if I can't export the Quantized model to Onnx or TensorRT. I can skip the Onnx model if there is a way to export to TRT directly. Thank you

def generate_fp8_scales(unet): # temporary solution due to a known bug in torch.onnx._dynamo_export for _, module in unet.named_modules(): if isinstance(module, (torch.nn.Linear, torch.nn.Conv2d)): module.input_quantizer._num_bits = 8 module.weight_quantizer._num_bits = 8 module.input_quantizer._amax = (module.input_quantizer._amax * 127) / 448.0 module.weight_quantizer._amax = (module.weight_quantizer._amax * 127) / 448.0 if args.quant_mode == 'fp8': generate_fp8_scales(model_quant) torch_to_onnx(model_quant, input)

NVIDIA / TensorRT-Model-Optimizer

Error when export ONNX for FP8 #15