Why TensorRT use Convolution instead MatMul in explicit quantized model

WoodieDudy commented 1 year ago

I have model

class MLP(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        d_model, d_ff = 512, 2048
        self.lin1 = nn.Linear(d_model, d_ff)
        self.activation = nn.ReLU()
        self.lin2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        x = self.lin1(x)
        x = self.activation(x)
        x = self.lin2(x)
        return x

I exported it to ONNX using explicit quantization with pytorch_quantization.

import torch
from pytorch_quantization import nn as quant_nn
from pytorch_quantization import quant_modules

quant_nn.TensorQuantizer.use_fb_fake_quant = True
quant_modules.initialize()
model = MLP().eval()

torch.onnx.export(
    model.cuda(),
    torch.rand(10240, 512).cuda(),
    "MLP_explicit_quant_fp32.onnx",
    verbose=False,
    input_names=["x"],
    opset_version=17
)

MLP_explicit_quant_fp32.onnx.zip

To build and visualize model

python TensorRT/tools/experimental/trt-engine-explorer/utils/process_engine.py MLP_explicit_quant_fp32.onnx temp int8 fp16

Output with trtexec commands to build same engine:

Building the engine:
trtexec --verbose --nvtxMode=verbose --buildOnly --workspace=8192 --onnx=onnx/MLP_explicit_quant_fp32.onnx --saveEngine=temp/MLP_explicit_quant_fp32.onnx.engine --timingCacheFile=./timing.cache --int8 --fp16

Successfully built the engine.

Engine building metadata: generated output file temp/MLP_explicit_quant_fp32.onnx.engine.build.metadata.json
Profiling the engine:
trtexec --verbose --noDataTransfers --useCudaGraph --separateProfileRun --useSpinWait --nvtxMode=verbose --loadEngine=temp/MLP_explicit_quant_fp32.onnx.engine --exportTimes=temp/MLP_explicit_quant_fp32.onnx.engine.timing.json --exportProfile=temp/MLP_explicit_quant_fp32.onnx.engine.profile.json --exportLayerInfo=temp/MLP_explicit_quant_fp32.onnx.engine.graph.json --timingCacheFile=./timing.cache --int8 --fp16
WARNING:root:Could not lock clocks (Insufficient Permissions).
    Try running as root or locking the clocks from the commandline:
        sudo nvidia-smi --lock-gpu-clocks=1410,1410
        sudo nvidia-smi --applications-clocks=1215,1410
WARNING:root:Could not unlock clocks (Insufficient Permissions).
    Try running as root or unlocking the clocks from the commandline:
        sudo nvidia-smi --reset-gpu-clocks
        sudo nvidia-smi --reset-applications-clocks

Successfully profiled the engine.

Profiling metadata: generated output file temp/MLP_explicit_quant_fp32.onnx.engine.profile.metadata.json
Generating graph diagram: temp/MLP_explicit_quant_fp32.onnx.engine.graph.json
/root/projects/TensorRT/tools/experimental/trt-engine-explorer/trex/engine_plan.py:90: UserWarning:

Profiling data was not provided.

Created file:///root/projects/TensorRT/tools/experimental/trt-engine-explorer/temp/MLP_explicit_quant_fp32.onnx.engine.graph.json.svg
Artifcats directory: temp

Build logs: MLP_explicit_quant_fp32.onnx.engine.build.log