NVIDIA / TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, pruning, distillation, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.
https://nvidia.github.io/TensorRT-Model-Optimizer
Other
536 stars 39 forks source link

Understanding the Underlying Implementation of model_calib #40

Open YixuanSeanZhou opened 4 months ago

YixuanSeanZhou commented 4 months ago

Hi,

I am trying to register a "custom layer" (not a native torch.nn layer, but a custom layer that is a super class of nn.Module) with modelopt and quantize it. I was doing minor patches to the modelopt torch quantization to be able to identify the places to insert quantizers. (For example, the layer uses layer.kernel to represents layer.weight).

However, when I export the model to onnx, i failed to build TRT engine due to

Error Code 10: Internal Error (Could not find any implementation for node /MatMul_%26_cpy.)

When I run the model with polygraphy python ~/trt_model_opt/bin/polygraphy run saved_model.onnx --onnxrt I got

MatmulInteger : b zero point is not valid.

I am guessing it is because the tensor quantizer is not properly calibrated. However, model_calib is unfortunately a .so file. I wonder if you can share the source code to the file or shed light on how calibration is performed, so I can modify the layer / how quantizer is inserted to get weights quantized.

Thanks!

hchings commented 3 months ago

Hi @YixuanSeanZhou, can you share your onnx file? Since the failure is at TRT engine building, this can be an issue that needs to be reported to TRT.