NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.73k stars 2.12k forks source link

[Hugging Face transformer models + pytorch_quantization] PTQ quantization int8 is slower than fp16 #1604

Closed pommedeterresautee closed 2 years ago

pommedeterresautee commented 2 years ago

Description

When using pytorch_quantization with Hugging Face models, whatever the seq len, the batch size and the model, int-8 is always slower than FP16. TensorRT models are produced with trtexec (see below)

Many PDQ nodes are just before a transpose node and then the matmul. I am under the impression it may be a source of performance issue (https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#qdq-placement-recs).

According to https://github.com/NVIDIA/sampleQAT/blob/master/postprocess_onnx.py:

    """
    This is a workaround to manually transpose the conv weights and remove
    the existing transpose nodes. Currently TRT has a limitation when there is
    a transpose node as an input to the weights of the conv layer. This utility 
    would be removed in future releases.
    """

May be linked to https://github.com/NVIDIA/TensorRT/issues/1532

Second point, it doesn't seem that bert module (https://github.com/NVIDIA/TensorRT/blob/main/tools/pytorch-quantization/pytorch_quantization/nn/modules/quant_bert.py) is enabled (https://github.com/NVIDIA/TensorRT/blob/main/tools/pytorch-quantization/pytorch_quantization/quant_modules.py#L26)


# int8 quantized models
[11/09/2021-11:51:11] [I] === Performance summary ===
[11/09/2021-11:51:11] [I] Throughput: 61.2925 qps
[11/09/2021-11:51:11] [I] Latency: min = 14.9854 ms, max = 26.0117 ms, mean = 16.2563 ms, median = 15.2119 ms, percentile(99%) = 22.6989 ms
[11/09/2021-11:51:11] [I] End-to-End Host Latency: min = 29.5244 ms, max = 44.1949 ms, mean = 32.2826 ms, median = 30.2827 ms, percentile(99%) = 43.3751 ms

# FP16 model - no QDQ nodes
[11/09/2021-11:52:29] [I] === Performance summary ===
[11/09/2021-11:52:29] [I] Throughput: 100.687 qps
[11/09/2021-11:52:29] [I] Latency: min = 9.50928 ms, max = 15.5975 ms, mean = 9.93139 ms, median = 9.64233 ms, percentile(99%) = 13.3743 ms
[11/09/2021-11:52:29] [I] End-to-End Host Latency: min = 18.1421 ms, max = 26.309 ms, mean = 19.6506 ms, median = 19.1113 ms, percentile(99%) = 24.8865 ms

int 8 Netron quantized model screenshot

image

Environment

TensorRT Version: 8.2 (preview) NVIDIA GPU: 3090 RTX NVIDIA Driver Version: 495.29.05 CUDA Version: 11.5 CUDNN Version: 8.3.0.98 Operating System: Linux Ubuntu 21.04 Python Version (if applicable): 3.9 PyTorch Version (if applicable): 1.10 Baremetal or Container (if so, version): Baremetal

Relevant Files

Onnx file is too big to be attached. It can be reproduced with the script below

Steps To Reproduce

To recreate both not quantized model + quantized artefacts (need hugging face transformers + pytorch_quantization), run the notebook below (at the very end there are 2 trtexec commands).

based on https://github.com/NVIDIA/TensorRT/blob/master/tools/pytorch-quantization/examples/calibrate_quant_resnet50.ipynb

#%%

from transformers import (
    AutoModelForSequenceClassification,
    PreTrainedModel,
)

import torch
import numpy as np
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm
from typing import Dict, List, Tuple

#%%

def convert_to_onnx(model_pytorch: PreTrainedModel, output_path: str, inputs_pytorch: Dict[str, torch.Tensor]) -> None:
    with torch.no_grad():
        torch.onnx.export(
            model_pytorch,  # model to optimize
            args=(inputs_pytorch["input_ids"], inputs_pytorch["attention_mask"]),  # tuple of multiple inputs
            f=output_path,  # output path / file object
            opset_version=13,  # the ONNX version to use
            do_constant_folding=True,  # simplify model (replace constant expressions)
            input_names=["input_ids", "attention_mask"],  # input names
            output_names=["model_output"],  # output name
            dynamic_axes={  # declare dynamix axis for each input / output (dynamic axis == variable length axis)
                "input_ids": {0: "batch_size", 1: "sequence"},
                "attention_mask": {0: "batch_size", 1: "sequence"},
                "model_output": {0: "batch_size"},
            },
            verbose=False,
        )

def prepare_input(seq_len: int, batch_size: int) -> Tuple[Dict[str, torch.Tensor], Dict[str, np.ndarray]]:
    shape = (batch_size, seq_len)
    input_ids = torch.randint(high=100, size=shape, dtype=torch.long, device="cuda")
    attention_mask = torch.ones(size=shape, dtype=torch.long, device="cuda")
    inputs_pytorch: Dict[str, torch.Tensor] = {"input_ids": input_ids, "attention_mask": attention_mask}
    inputs_onnx: Dict[str, np.ndarray] = {
        k: np.ascontiguousarray(v.detach().cpu().numpy()) for k, v in inputs_pytorch.items()
    }
    return inputs_pytorch, inputs_onnx

huggingface_hub_path = "cross-encoder/ms-marco-MiniLM-L-6-v2"
seq_len = 128
batch_size = 8
input_torch, input_numpy = prepare_input(seq_len, 1)
print(input_torch)

#%%

non_q_model_pytorch: PreTrainedModel = AutoModelForSequenceClassification.from_pretrained(huggingface_hub_path)
non_q_model_pytorch.cuda()
non_q_model_pytorch.eval()
convert_to_onnx(model_pytorch=non_q_model_pytorch, output_path="./not_quantized.onnx", inputs_pytorch=input_torch)

#%%

from pytorch_quantization import nn as quant_nn
from pytorch_quantization import quant_modules
from pytorch_quantization import nn as quant_nn
from pytorch_quantization import calib
from pytorch_quantization.tensor_quant import QuantDescriptor

#%%

quant_desc_input = QuantDescriptor(calib_method='histogram')
quant_nn.QuantConv2d.set_default_quant_desc_input(quant_desc_input)
quant_nn.QuantLinear.set_default_quant_desc_input(quant_desc_input)

quant_modules.initialize()

#%%

model_pytorch: PreTrainedModel = AutoModelForSequenceClassification.from_pretrained(huggingface_hub_path)
assert torch.cuda.is_available()
model_pytorch.cuda()
model_pytorch.eval()

#%%

def collect_stats(model, data_loader, num_batches):
    """Feed data to the network and collect statistic"""

    # Enable calibrators
    for name, module in model.named_modules():
        if isinstance(module, quant_nn.TensorQuantizer):
            if module._calibrator is not None:
                module.disable_quant()
                module.enable_calib()
            else:
                module.disable()

    for i, data in tqdm(enumerate(data_loader), total=num_batches):
        print(data)
        model(**data)
        if i >= num_batches:
            break

    # Disable calibrators
    for name, module in model.named_modules():
        if isinstance(module, quant_nn.TensorQuantizer):
            if module._calibrator is not None:
                module.enable_quant()
                module.disable_calib()
            else:
                module.enable()

def compute_amax(model, **kwargs):
    # Load calib result
    for name, module in model.named_modules():
        if isinstance(module, quant_nn.TensorQuantizer):
            if module._calibrator is not None:
                if isinstance(module._calibrator, calib.MaxCalibrator):
                    module.load_calib_amax()
                else:
                    module.load_calib_amax(**kwargs)
#             print(F"{name:40}: {module}")
    model.cuda()

#%%

class CustomTextDataset(Dataset):
    def __init__(self):
        pass

    def __len__(self):
            return 20

    def __getitem__(self, idx):
            return {k: v.squeeze() for k, v in input_torch.items()}

data_loader = DataLoader(CustomTextDataset(), batch_size=batch_size)
print(next(enumerate(data_loader)))

#%%

_, b = list(tqdm(enumerate(data_loader), total=1))[0]
print(b)
print(model_pytorch(**b))

#%%

with torch.no_grad():
    collect_stats(model_pytorch, data_loader, num_batches=2)
    compute_amax(model_pytorch, method="percentile", percentile=99.99)

#%%

print(model_pytorch(**b))

#%%

quant_nn.TensorQuantizer.use_fb_fake_quant = True

# fix in tensor_quantizer.py
# inputs, scale.data, torch.zeros_like(scale, dtype=torch.int32).data, quant_dim,

convert_to_onnx(model_pytorch=model_pytorch, output_path="./quantization.onnx", inputs_pytorch=b)

#%%

!/usr/src/tensorrt/bin/trtexec --onnx=quantization.onnx --best --shapes=input_ids:32x384,attention_mask:32x384 --workspace=9000 --verbose --dumpProfile --separateProfileRun  # --saveEngine=engine.trt  --exportTimes=timings.json

#%%

!/usr/src/tensorrt/bin/trtexec --onnx=./not_quantized.onnx --best --shapes=input_ids:32x384,attention_mask:32x384 --workspace=9000 --verbose --dumpProfile --separateProfileRun

#%%
ttyio commented 2 years ago

Hello @pommedeterresautee ,

The transpose between initializer + Q/DQ and matmul will not hurt perf in trt. they are processed during engine in build stage.

For the quant_bert.py, this is no longer used, we would remove, they are already in upstream https://huggingface.co/docs/transformers/model_doc/qdqbert

For the INT8 slower than fp16, I have created internal bug to track this, thanks!

pommedeterresautee commented 2 years ago

Thank you, indeed I have tried the new QDQBert model and it works as expected (2X faster than FP16 on 3090 RTX).