NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.68k stars 2.12k forks source link

Why TensorRT use Convolution instead MatMul in explicit quantized model #3266

Open WoodieDudy opened 1 year ago

WoodieDudy commented 1 year ago

I have model

class MLP(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        d_model, d_ff = 512, 2048
        self.lin1 = nn.Linear(d_model, d_ff)
        self.activation = nn.ReLU()
        self.lin2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        x = self.lin1(x)
        x = self.activation(x)
        x = self.lin2(x)
        return x

I exported it to ONNX using explicit quantization with pytorch_quantization.

import torch
from pytorch_quantization import nn as quant_nn
from pytorch_quantization import quant_modules

quant_nn.TensorQuantizer.use_fb_fake_quant = True
quant_modules.initialize()
model = MLP().eval()

torch.onnx.export(
    model.cuda(),
    torch.rand(10240, 512).cuda(),
    "MLP_explicit_quant_fp32.onnx",
    verbose=False,
    input_names=["x"],
    opset_version=17
)

MLP_explicit_quant_fp32.onnx.zip

To build and visualize model

python TensorRT/tools/experimental/trt-engine-explorer/utils/process_engine.py MLP_explicit_quant_fp32.onnx temp int8 fp16

Output with trtexec commands to build same engine:

Building the engine:
trtexec --verbose --nvtxMode=verbose --buildOnly --workspace=8192 --onnx=onnx/MLP_explicit_quant_fp32.onnx --saveEngine=temp/MLP_explicit_quant_fp32.onnx.engine --timingCacheFile=./timing.cache --int8 --fp16

Successfully built the engine.

Engine building metadata: generated output file temp/MLP_explicit_quant_fp32.onnx.engine.build.metadata.json
Profiling the engine:
trtexec --verbose --noDataTransfers --useCudaGraph --separateProfileRun --useSpinWait --nvtxMode=verbose --loadEngine=temp/MLP_explicit_quant_fp32.onnx.engine --exportTimes=temp/MLP_explicit_quant_fp32.onnx.engine.timing.json --exportProfile=temp/MLP_explicit_quant_fp32.onnx.engine.profile.json --exportLayerInfo=temp/MLP_explicit_quant_fp32.onnx.engine.graph.json --timingCacheFile=./timing.cache --int8 --fp16
WARNING:root:Could not lock clocks (Insufficient Permissions).
    Try running as root or locking the clocks from the commandline:
        sudo nvidia-smi --lock-gpu-clocks=1410,1410
        sudo nvidia-smi --applications-clocks=1215,1410
WARNING:root:Could not unlock clocks (Insufficient Permissions).
    Try running as root or unlocking the clocks from the commandline:
        sudo nvidia-smi --reset-gpu-clocks
        sudo nvidia-smi --reset-applications-clocks

Successfully profiled the engine.

Profiling metadata: generated output file temp/MLP_explicit_quant_fp32.onnx.engine.profile.metadata.json
Generating graph diagram: temp/MLP_explicit_quant_fp32.onnx.engine.graph.json
/root/projects/TensorRT/tools/experimental/trt-engine-explorer/trex/engine_plan.py:90: UserWarning:

Profiling data was not provided.

Created file:///root/projects/TensorRT/tools/experimental/trt-engine-explorer/temp/MLP_explicit_quant_fp32.onnx.engine.graph.json.svg
Artifcats directory: temp

Build logs: MLP_explicit_quant_fp32.onnx.engine.build.log

I noticed that TensorRT use Convolution instead MatMul despite the fact that @nvpohanh said here that with explicit quantization Convilution should be replaced by MatMul

MLP_explicit_quant_fp32 onnx engine graph json

nvpohanh commented 1 year ago

By default, TRT uses INT8 Convs for all the INT8 Gemms. In theory, they should have similar performance.

If you really want to use INT8 Gemm kernels, please add a LayerNormalization or an MHA pattern to trigger the Transformer optimization in TRT.

WoodieDudy commented 1 year ago

I added layernorm and the block merged into myelin MLP_explicit_quant_fp32_layernorm onnx engine graph json

nvpohanh commented 1 year ago

Yes, that is expected. So now if you enable explicit quantization (by adding Q/DQ ops before the gemm), you should see INT8 gemm kernels being used.

WoodieDudy commented 1 year ago

I already have Q/DQ added as you can see in the first svg picture. Does this mean that the calculations inside myelin are done in int8? May be I can check it using nsys profiler? Original onnx looks like this

image
nvpohanh commented 1 year ago

I think we will need to check the Nsys profile to know exactly which kernels are used.

WoodieDudy commented 1 year ago

I profiled it. So it uses int8 inside👍 SCR-20230904-oozi

nvpohanh commented 1 year ago

That kernel (igemm_int8) is INT8-input FP32-output gemm kernel and it does not use TensorCores. Could you add additional Q/DQ after the gemm so that it uses INT8-output to achieve potentially better perf? Thanks

vadimkantorov commented 1 year ago

@nvpohanh Should it also do the right thing if we set inputIOformats + outputIOformats as fp16?

nvpohanh commented 1 year ago

I don't think IO formats matters, though. You just need to add Q/DQ after the last gemm

vadimkantorov commented 1 year ago

You just need to add Q/DQ after the last gemm

How can we do that? replace return x by return TensorQuantizer(quant_nn.QuantLinear.default_quant_desc_input)(x)? or directly sth like return torch.fake_quantize_per_tensor_affine(x, x.amax() / 127, 0, -128, 127)? Or do you have some other way of Q/DQ in mind?

I'm a bit worried that we shall somehow force it to use some unneeded qparams while in fact we just want a fp16 output.

Also, for faithfulness of the microbenchmark, we still would like to have fp16 as inputs and outputs as that's the real final setup.

Thanks @nvpohanh for your kind help!