NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.47k stars 2.1k forks source link

Marginal Improvement Between INT8 and FP16 #2843

Open alexriggio opened 1 year ago

alexriggio commented 1 year ago

I have INT8 quantized a BERT model for binary text classification and am only getting a marginal improvement in speed over FP16.

I am using the transformer-deploy library that utilizes TensorRT.

Tested on both an A4000 and A100 GPU.

A4000 --> TensorRT INT-8: 34.48ms, TensorRT FP16: 38.72ms A100 ---> TensorRT INT-8: 11.53ms, TensorRT FP16: 11.75ms

These are the components that were quant disabled to improve accuracy:

disable bert.encoder.layer.1.intermediate.dense._input_quantizer
disable bert.encoder.layer.2.attention.output.layernorm_quantizer_0
disable bert.encoder.layer.2.attention.output.layernorm_quantizer_1
disable bert.encoder.layer.2.output.layernorm_quantizer_0
disable bert.encoder.layer.2.output.layernorm_quantizer_1
disable bert.encoder.layer.3.attention.output.dense._input_quantizer
disable bert.encoder.layer.10.attention.self.key._input_quantizer
disable bert.encoder.layer.11.attention.output.dense._input_quantizer
disable bert.encoder.layer.11.output.dense._input_quantizer

The debug logs from the A4000 run are attached here:

trt_logs_int8_quantization.txt

I also tried using the Profiler during the inference of one sample in case that provided any useful information but it seemed to only print out one layer. I assume it should print out info on many layers?

profiler = trt.Profiler()
context.profiler = profiler   

context: IExecutionContext = engine.create_execution_context()
context.set_optimization_profile_async(
    profile_index=profile_index, stream_handle=torch.cuda.current_stream().cuda_stream
)
input_binding_idxs, output_binding_idxs = get_binding_idxs(engine, profile_index)  # type: List[int], List[int]

data = train_tokenized[0:1]
input_torch: OD[str, torch.Tensor] = convert_tensor(data=data, output="torch")
input_np: OD[str, np.ndarray] = convert_tensor(data=data, output="np")

tensorrt_output = infer_tensorrt(
    context=context,
    inputs=input_torch,
    input_binding_idxs=input_binding_idxs,
    output_binding_idxs=output_binding_idxs,
)

-----------------------------------------------------------------------------------------

[HostToDeviceCopy]: 0.018464ms
{ForeignNode[bert.embeddings.position_embeddings.weight...(Unnamed Layer* 2639) [ElementWise]]}: 1.73869ms

Any insight into these results is greatly appreciated. Thank you.

Versions: Python: 3.10.9 transformers-deploy: 0.5.4 TensorRT: 8.4.1.5 Onnxruntime (GPU): 1.12.0 Cuda: 11.7

zerollzeng commented 1 year ago

Can you try our latest 8.6 EA? I think the reason might be due to still many layers are running in FP16.

[03/30/2023-23:30:17] [TRT] [V] Engine Layer Information:
Layer(ShapeHostToDevice): [HostToDeviceCopy], Tactic: 0x0000000000000000,  -> token_type_ids[implicit padding 0][Int32()]
Layer(Myelin): {ForeignNode[bert.embeddings.position_embeddings.weight...(Unnamed Layer* 2639) [ElementWise]]}, Tactic: 0x0000000000000000, token_type_ids[Int32(-6,256)], input_ids[Int32(-6,256)], attention_mask[Int32(-6,256)], token_type_ids[implicit padding 0][Int32()] -> output1[Float(-6,2)]
[03/30/2023-23:30:17] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +256, now: CPU 0, GPU 256 (MiB)

@nvpohanh Should know more about it

nvpohanh commented 1 year ago

Could you share the ONNX files as well as the Nsys reports following the instructions here?

https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#report-performance-issue

alexriggio commented 1 year ago

Edit: I have removed the link to the ONNX file since the company would like to keep it private. If you have not already downloaded it is there an email or somewhere I could send it to?

I was having problems installing tensorrt 8.6 EA. I tried standard 8.6 with pip install tensorrt however that version seems to kill the kernel when I run trt.Runtime().

trt_logger: Logger = trt.Logger(trt.Logger.ERROR)
runtime: Runtime = trt.Runtime(trt_logger)

Also, for the nsys reports, I cannot find where /usr/src/tensorrt/tensorrt is located on my system (using papserspace gradient) to run nsys profile or trtexec with the CLI command. Is there anyway to run it through python code?

This is the function I call to convert to onnx.

data = train_tokenized[1:3]
input_torch = convert_tensor(data, output="torch")
convert_to_onnx(
    model_pytorch=model_q,
    output_path="model_qat.onnx",
    inputs_pytorch=input_torch,
    quantization=True,
    var_output_seq=False,
    output_names = ["output1"] 
)
# TODO manage encoder / decoder architecture + cache
def convert_to_onnx(
    model_pytorch: torch.nn.Module,
    output_path: str,
    inputs_pytorch: Dict[str, torch.Tensor],
    quantization: bool,
    var_output_seq: bool,
    output_names: List[str],
    load_external_data: bool = False,
) -> None:
    """
    Convert a Pytorch model to an ONNX graph by tracing the provided input inside the Pytorch code.
    Pytorch sometimes fails to infer output tensor shape of models
    In ONNX graph, some axis name may be marked like "Divoutput_dim_1" which is a generated name,
    and there may be a warning:
    ** "WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference
    for the exported graph. Please consider adding it in symbolic function." **
    ex.: https://discuss.pytorch.org/t/bidirectional-lstm-and-onnx-runtime-warnings/136374
    :param model_pytorch: Pytorch model (transformers)
    :param output_path: where to save ONNX file
    :param inputs_pytorch: Tensor, can be dummy data, shape is not important as we declare all axes as dynamic.
    Should be on the same device than the model (CPU or GPU)
    :param quantization: model is quantized
    :param var_output_seq: variable size sequence
    :param output_names: list of output names in ONNX model
    """
    if quantization:
        try:
            from pytorch_quantization.nn import TensorQuantizer
        except ImportError:
            raise ImportError(
                "It seems that pytorch-quantization is not yet installed. "
                "It is required when you enable the quantization flag and use CUDA device."
                "Please find installation instructions on "
                "https://github.com/NVIDIA/TensorRT/tree/main/tools/pytorch-quantization or use:\n"
                "pip3 install git+ssh://git@github.com/NVIDIA/TensorRT#egg=pytorch-quantization\\&"
                "subdirectory=tools/pytorch-quantization/"
            )

        TensorQuantizer.use_fb_fake_quant = True
    if hasattr(model_pytorch, "config") and hasattr(model_pytorch.config, "use_cache"):
        use_cache = getattr(model_pytorch.config, "use_cache")
        setattr(model_pytorch.config, "use_cache", False)

    # dynamic axis == variable length axis
    dynamic_axis = dict()
    for k in inputs_pytorch.keys():
        if var_output_seq:
            # seq axis name is fixed to be matched with output seq axis name (for output shape prediction)
            dynamic_axis[k] = {0: "batch_size", 1: "sequence"}
        else:
            # if there is no specific requirement, each axis name is unique, fix some issue on T5 model
            dynamic_axis[k] = {0: "batch_size", 1: f"sequence-{k}"}
    for output_name in output_names:
        dynamic_axis[output_name] = {0: "batch_size"}
        if var_output_seq:
            dynamic_axis[output_name][1] = "sequence"
    # replace int64 input tensors by int32 -> for ONNX Runtime binding API and expected by TensorRT engine
    for k, v in inputs_pytorch.items():
        if not isinstance(v, torch.Tensor):
            continue
        if v.dtype in [torch.long, torch.int64]:
            inputs_pytorch[k] = v.type(torch.int32)
    # get input names in the same order as in the model forward
    model_args = model_pytorch.forward.__code__.co_varnames
    input_names = []
    for arg_name in model_args:
        if arg_name in inputs_pytorch.keys():
            input_names.append(arg_name)
    # sentence transformer model forward is kargs and kwargs
    if len(input_names) == 0:
        input_names = list(inputs_pytorch.keys())
    with torch.no_grad():
        torch.onnx.export(
            model_pytorch,  # model to optimize
            args=tuple(inputs_pytorch.values()),  # tuple of multiple inputs
            f=output_path,  # output path / file object
            opset_version=13,  # the ONNX version to use, >= 13 supports channel quantized model
            do_constant_folding=True,  # simplify model (replace constant expressions)
            input_names=input_names,  # input names
            output_names=output_names,  # output names
            dynamic_axes=dynamic_axis,  # declare dynamix axis for each input / output
            training=TrainingMode.EVAL,  # always put the model in evaluation mode
            verbose=False,
        )
    proto = onnx.load(output_path, load_external_data=load_external_data)
    save_onnx(proto=proto, model_path=output_path)
    if quantization:
        TensorQuantizer.use_fb_fake_quant = False
    if hasattr(model_pytorch, "config") and hasattr(model_pytorch.config, "use_cache"):
        setattr(model_pytorch.config, "use_cache", use_cache)