I managed to quantize the model using NVIDIA/pytorch-quantization. From my experiments, the accuracy drop is around 3%, gpu memory only reduced by 10% and the speed (PTQ->TensorRT-FP16-INT8) is close to the TensorRT-FP16 (no PTQ). Personally, It doesn't really help much.
I managed to quantize the model using NVIDIA/pytorch-quantization. From my experiments, the accuracy drop is around 3%, gpu memory only reduced by 10% and the speed (PTQ->TensorRT-FP16-INT8) is close to the TensorRT-FP16 (no PTQ). Personally, It doesn't really help much.