Open Michelvl92 opened 5 months ago
You can ref
network->getInput(0)->setAllowedFormats(static_cast<TensorFormats>(1 << static_cast<int32_t>(TensorFormat::kLINEAR)));
network->getInput(0)->setType(DataType::kINT8);
@lix19937 thanks for your reaction, how would you do this in python?
formats = 1 << int(tensorrt.TensorFormat.LINEAR)
network.get_input(0).allowed_formats = formats
network.get_input(0).dtype = tensorrt.DataType.INT8
@lix19937 thx, do I need to change anything during the quantization process with nvidia pytorch-quantization-toolkit (can it influence model accuracy?), before I apply INT8 to the TensorRT conversions
By default, my input trt graph looks as below
pytorch-quantization need retrain (fine tune), which need you eval model accuracy.
I know (already done), but what I mean with the pytorch-quantization, do I need to set and/or finetune wint int8 input as well?
Do not need.
How should I do then the normalization?
Your net input if fp32 NCHW, then auto reformat to int8 NCHW (for int8 cnn), you need not do anything.
If you want to improve infer speed, you can fusion your preprocess (maybe int8 BGR img do some normal like (x*scale-mean/std) ), can reformat free.
If you want to improve infer speed, you can fusion your preprocess (maybe int8 BGR img do some normal like (x*scale-mean/std) ), can reformat free.
Do you have an example of how to do this in python?
If you use python, you can use transforms.Normalize()
, or use numpy
https://github.com/NVIDIA/TensorRT/blob/release/10.0/demo/Diffusion/utilities.py#L335
I do get what you mean, but do not understand how to fuse preprocessing to improve inference in my case?
So, what should i do where (without accuracy loss)?
During trt inference with int8 input, should I only do: int8_array = (uint8_array - 128).astype(np.int8)
?
Or should I also do this during int8 calibration?
And what aboutn unit8 input?
If your net input is BCHW format with RGB channels float32 (0.0-1.0)
, you need not do anything. Let trt do reformat (matmul and permute)
BTW, usually img are uint8 BGR data, so I consider fuse this in trt inference.
If your net input is BCHW format with RGB channels float32 (0.0-1.0) , you need not do anything. Let trt do reformat (matmul and permute)__ Yes, this is currently happening, but the reformat takes "a lot" of time. Furthermore, I need copy more data FP32 vs INT8
BTW, usually img are uint8 BGR data, so I consider fuse this in trt inference. Could you provide an example on how to do this?
BTW, usually img are uint8 BGR data, so I consider fuse this in trt inference.
uint8 as input (assume HWC ) -->
do some normalize (now if fp32) + permute (HWC->CHW) + mul( * q_scale) -->
int8 (CHW) output to nn-net
the step 2 should write a cuda kernel to improve speed.
uint8-in --> int8-out
, reduced one type conversion(CAST).
how to get the q_scale
?
You can upload qat.onnx .
how to get the
q_scale
?
q_scale from adjacency conv`s q scale.
@Michelvl92
Hello. Would you provide the following example code?
By using pytorch-quantization i was able to create TensorRT engine models that are (almost) fully int8 and have lower latencies than FP16 equivalent models.
I want to convert my custom model into TensorRT 10.5 engine, and run int8 inference, but there are few materials I can refer to. I would appreciate it if you provide the example code.
By using pytorch-quantization i was able to create TensorRT engine models that are (almost) fully int8 and have lower latencies than FP16 equivalent models.
One of the downsides is that the input is reformatted from FP32 to INT8 for the next 2D Conv. This takes up to 5% of the total latency budget of the model. On the other hand, I need to cast my images from uint8/int8 to FP32. As you can see this is not efficient and introduces double casting + transferring 4x more memory to the GPU for inference, which is not needed since the model is fully int8.
Is there a possibility to create engines that have (u)int8 input? And do I need to adopt anything during quantization?