NVIDIA / TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, pruning, distillation, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.
https://nvidia.github.io/TensorRT-Model-Optimizer
Other
581 stars 44 forks source link

Error when export ONNX for FP8 #15

Closed chuong98 closed 6 months ago

chuong98 commented 6 months ago

I am testing exporting model ResNet18 provided by Timm, and I use Docker so the result is reproducable.

Both models were quantized successfully, FP8 has better accuracy than INT8, but it is useless if I can't export the Quantized model to Onnx or TensorRT. I can skip the Onnx model if there is a way to export to TRT directly. Thank you

chuong98 commented 6 months ago

After using the function generate_fp8_scale in the example diffusers and opset 17 then it works. For who has the same error:

def generate_fp8_scales(unet):
    # temporary solution due to a known bug in torch.onnx._dynamo_export
    for _, module in unet.named_modules():
        if isinstance(module, (torch.nn.Linear, torch.nn.Conv2d)):
            module.input_quantizer._num_bits = 8
            module.weight_quantizer._num_bits = 8
            module.input_quantizer._amax = (module.input_quantizer._amax * 127) / 448.0
            module.weight_quantizer._amax = (module.weight_quantizer._amax * 127) / 448.0

if args.quant_mode == 'fp8':
    generate_fp8_scales(model_quant)
torch_to_onnx(model_quant, input)