structured sparsity does not give any speedup on either GPU or DLA

First I applied sparsity as follows

from apex.contrib.sparsity import ASP
model_sparse.model.cuda()
optimizer_sparse = torch.optim.AdamW(model_sparse.parameters(), lr=learning_rate, weight_decay=0.05)
ASP.prune_trained_model(model_sparse, optimizer_sparse)
trainer.fit(model=model_sparse, train_dataloaders=train_loader)
torch.save({"state_dict": model_sparse.state_dict()}, '/home/orin-1/yue/TLR/models/model_sparse.ckpt')

then I reload the pruned model from the checkpoint and apply the quantization and then export the onnx as follows

def prune_trained_model_custom(model, optimizer, compute_sparse_masks=True):
    asp = ASP()
    asp.init_model_for_pruning(model, mask_calculator="m4n2_1d", verbosity=2, whitelist=[quant_nn.QuantLinear, quant_nn.QuantConv2d], allow_recompute_mask=False)
    asp.init_optimizer_for_pruning(optimizer)
    if compute_sparse_masks:
        asp.compute_sparse_masks()

prune_trained_model_custom(model.model, optimizer_sparse)
model.optimizer = optimizer_sparse
trainer.fit(model=model, train_dataloaders=train_loader)
quant_nn.TensorQuantizer.use_fb_fake_quant = True
torch.onnx.export(model.model.cuda(), dummy_input.cuda(), "/home/orin-1/yue/TLR/export_fine/qat_sparse_864_gpu.onnx", verbose=False, input_names=input_names, output_names=output_names)

then I run the provided script to remove qdq and save the calib cache and export the quantized trt engine as follows, I also tried on GPU only without using DLA, still no speedup.

/usr/src/tensorrt/bin/trtexec --onnx='qat_sparse_864_gpu_noqdq.onnx' --saveEngine=qat_sparse_864_gpu_noqdq.trt' --int8 --fp16 --calib='qat_sparse_864_gpu_precision_config_calib.cache' --profilingVerbosity=detailed --sparsity=force --verbose --allowGPUFallback --useDLACore=0

However this sparse model does not give any speedup because none of the layers are eligible for sparse math from the following log. But I am sure that the structured sparsity meets the requirements.

[01/04/2024-18:25:05] [I] [TRT] (Sparsity) Layers eligible for sparse math:
[01/04/2024-18:25:05] [I] [TRT] (Sparsity) TRT inference plan picked sparse implementation for layers:
[01/04/2024-18:25:05] [V] [TRT] Total number of generated kernels selected for the engine: 0
[01/04/2024-18:25:05] [V] [TRT] Disabling unused tactic source: CUDNN
[01/04/2024-18:25:05] [V] [TRT] Disabling unused tactic source: CUBLAS, CUBLAS_LT
[01/04/2024-18:25:05] [V] [TRT] Disabling unused tactic source: EDGE_MASK_CONVOLUTIONS
[01/04/2024-18:25:05] [V] [TRT] Disabling unused tactic source: JIT_CONVOLUTIONS

I have attached my onnx model and trt engine for you to reproduce. Thanks! Desktop.zip

So my question is: am I doing the sparsity correctly and if not, how to get the claimed speedup from adding structured sparsity?

NVIDIA / Deep-Learning-Accelerator-SW

structured sparsity does not give any speedup on either GPU or DLA #19