NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.77k stars 2.13k forks source link

Pointers for TensorRT model with uint8/int8 input #3914

Open Michelvl92 opened 5 months ago

Michelvl92 commented 5 months ago

By using pytorch-quantization i was able to create TensorRT engine models that are (almost) fully int8 and have lower latencies than FP16 equivalent models.

One of the downsides is that the input is reformatted from FP32 to INT8 for the next 2D Conv. This takes up to 5% of the total latency budget of the model. On the other hand, I need to cast my images from uint8/int8 to FP32. As you can see this is not efficient and introduces double casting + transferring 4x more memory to the GPU for inference, which is not needed since the model is fully int8.

Is there a possibility to create engines that have (u)int8 input? And do I need to adopt anything during quantization?

lix19937 commented 5 months ago

You can ref

network->getInput(0)->setAllowedFormats(static_cast<TensorFormats>(1 << static_cast<int32_t>(TensorFormat::kLINEAR)));

network->getInput(0)->setType(DataType::kINT8);
Michelvl92 commented 5 months ago

@lix19937 thanks for your reaction, how would you do this in python?

lix19937 commented 5 months ago
formats = 1 << int(tensorrt.TensorFormat.LINEAR)
network.get_input(0).allowed_formats = formats
network.get_input(0).dtype = tensorrt.DataType.INT8 
Michelvl92 commented 5 months ago

@lix19937 thx, do I need to change anything during the quantization process with nvidia pytorch-quantization-toolkit (can it influence model accuracy?), before I apply INT8 to the TensorRT conversions

By default, my input trt graph looks as below

fp32_input

lix19937 commented 4 months ago

pytorch-quantization need retrain (fine tune), which need you eval model accuracy.

Michelvl92 commented 4 months ago

I know (already done), but what I mean with the pytorch-quantization, do I need to set and/or finetune wint int8 input as well?

lix19937 commented 4 months ago

Do not need.

Michelvl92 commented 4 months ago

How should I do then the normalization?

lix19937 commented 4 months ago

Your net input if fp32 NCHW, then auto reformat to int8 NCHW (for int8 cnn), you need not do anything.
If you want to improve infer speed, you can fusion your preprocess (maybe int8 BGR img do some normal like (x*scale-mean/std) ), can reformat free.

Michelvl92 commented 4 months ago

If you want to improve infer speed, you can fusion your preprocess (maybe int8 BGR img do some normal like (x*scale-mean/std) ), can reformat free.

Do you have an example of how to do this in python?

lix19937 commented 4 months ago

If you use python, you can use transforms.Normalize(), or use numpy https://github.com/NVIDIA/TensorRT/blob/release/10.0/demo/Diffusion/utilities.py#L335

Michelvl92 commented 4 months ago

I do get what you mean, but do not understand how to fuse preprocessing to improve inference in my case?

So, what should i do where (without accuracy loss)?

During trt inference with int8 input, should I only do: int8_array = (uint8_array - 128).astype(np.int8) ? Or should I also do this during int8 calibration?

And what aboutn unit8 input?

lix19937 commented 4 months ago

If your net input is BCHW format with RGB channels float32 (0.0-1.0) , you need not do anything. Let trt do reformat (matmul and permute)

BTW, usually img are uint8 BGR data, so I consider fuse this in trt inference.

Michelvl92 commented 4 months ago

If your net input is BCHW format with RGB channels float32 (0.0-1.0) , you need not do anything. Let trt do reformat (matmul and permute)__ Yes, this is currently happening, but the reformat takes "a lot" of time. Furthermore, I need copy more data FP32 vs INT8

BTW, usually img are uint8 BGR data, so I consider fuse this in trt inference. Could you provide an example on how to do this?

lix19937 commented 4 months ago

BTW, usually img are uint8 BGR data, so I consider fuse this in trt inference.

uint8 as input (assume HWC )  -->     

do some normalize (now if fp32) + permute (HWC->CHW) + mul( * q_scale)  -->     

int8 (CHW) output to nn-net

the step 2 should write a cuda kernel to improve speed.

lix19937 commented 4 months ago

uint8-in --> int8-out, reduced one type conversion(CAST).

Michelvl92 commented 4 months ago

how to get the q_scale?

lix19937 commented 4 months ago

You can upload qat.onnx .

lix19937 commented 4 months ago

how to get the q_scale?

q_scale from adjacency conv`s q scale.

jhss commented 1 month ago

@Michelvl92

Hello. Would you provide the following example code?

By using pytorch-quantization i was able to create TensorRT engine models that are (almost) fully int8 and have lower latencies than FP16 equivalent models.

I want to convert my custom model into TensorRT 10.5 engine, and run int8 inference, but there are few materials I can refer to. I would appreciate it if you provide the example code.