then I run the provided script to remove qdq and save the calib cache and export the quantized trt engine as follows, I also tried on GPU only without using DLA, still no speedup.
However this sparse model does not give any speedup because none of the layers are eligible for sparse math from the following log. But I am sure that the structured sparsity meets the requirements.
[01/04/2024-18:25:05] [I] [TRT] (Sparsity) Layers eligible for sparse math:
[01/04/2024-18:25:05] [I] [TRT] (Sparsity) TRT inference plan picked sparse implementation for layers:
[01/04/2024-18:25:05] [V] [TRT] Total number of generated kernels selected for the engine: 0
[01/04/2024-18:25:05] [V] [TRT] Disabling unused tactic source: CUDNN
[01/04/2024-18:25:05] [V] [TRT] Disabling unused tactic source: CUBLAS, CUBLAS_LT
[01/04/2024-18:25:05] [V] [TRT] Disabling unused tactic source: EDGE_MASK_CONVOLUTIONS
[01/04/2024-18:25:05] [V] [TRT] Disabling unused tactic source: JIT_CONVOLUTIONS
I have attached my onnx model and trt engine for you to reproduce. Thanks!
Desktop.zip
So my question is: am I doing the sparsity correctly and if not, how to get the claimed speedup from adding structured sparsity?
/usr/src/tensorrt/bin/trtexec --onnx='qat_sparse_864_gpu_noqdq.onnx' --saveEngine=qat_sparse_864_gpu_noqdq.trt' --int8 --fp16 --calib='qat_sparse_864_gpu_precision_config_calib.cache' --profilingVerbosity=detailed --sparsity=force --verbose --allowGPUFallback --useDLACore=0
However this sparse model does not give any speedup because none of the layers are eligible for sparse math from the following log. But I am sure that the structured sparsity meets the requirements.
I have attached my onnx model and trt engine for you to reproduce. Thanks! Desktop.zip
So my question is: am I doing the sparsity correctly and if not, how to get the claimed speedup from adding structured sparsity?