NVIDIA / Deep-Learning-Accelerator-SW

NVIDIA DLA-SW, the recipes and tools for running deep learning workloads on NVIDIA DLA cores for inference applications.
Other
181 stars 15 forks source link

structured sparsity does not give any speedup on either GPU or DLA #19

Closed slimwangyue closed 10 months ago

slimwangyue commented 10 months ago
  1. First I applied sparsity as follows
from apex.contrib.sparsity import ASP
model_sparse.model.cuda()
optimizer_sparse = torch.optim.AdamW(model_sparse.parameters(), lr=learning_rate, weight_decay=0.05)
ASP.prune_trained_model(model_sparse, optimizer_sparse)
trainer.fit(model=model_sparse, train_dataloaders=train_loader)
torch.save({"state_dict": model_sparse.state_dict()}, '/home/orin-1/yue/TLR/models/model_sparse.ckpt')
  1. then I reload the pruned model from the checkpoint and apply the quantization and then export the onnx as follows
def prune_trained_model_custom(model, optimizer, compute_sparse_masks=True):
    asp = ASP()
    asp.init_model_for_pruning(model, mask_calculator="m4n2_1d", verbosity=2, whitelist=[quant_nn.QuantLinear, quant_nn.QuantConv2d], allow_recompute_mask=False)
    asp.init_optimizer_for_pruning(optimizer)
    if compute_sparse_masks:
        asp.compute_sparse_masks()

prune_trained_model_custom(model.model, optimizer_sparse)
model.optimizer = optimizer_sparse
trainer.fit(model=model, train_dataloaders=train_loader)
quant_nn.TensorQuantizer.use_fb_fake_quant = True
torch.onnx.export(model.model.cuda(), dummy_input.cuda(), "/home/orin-1/yue/TLR/export_fine/qat_sparse_864_gpu.onnx", verbose=False, input_names=input_names, output_names=output_names)
  1. then I run the provided script to remove qdq and save the calib cache and export the quantized trt engine as follows, I also tried on GPU only without using DLA, still no speedup.

/usr/src/tensorrt/bin/trtexec --onnx='qat_sparse_864_gpu_noqdq.onnx' --saveEngine=qat_sparse_864_gpu_noqdq.trt' --int8 --fp16 --calib='qat_sparse_864_gpu_precision_config_calib.cache' --profilingVerbosity=detailed --sparsity=force --verbose --allowGPUFallback --useDLACore=0

However this sparse model does not give any speedup because none of the layers are eligible for sparse math from the following log. But I am sure that the structured sparsity meets the requirements.

[01/04/2024-18:25:05] [I] [TRT] (Sparsity) Layers eligible for sparse math:
[01/04/2024-18:25:05] [I] [TRT] (Sparsity) TRT inference plan picked sparse implementation for layers:
[01/04/2024-18:25:05] [V] [TRT] Total number of generated kernels selected for the engine: 0
[01/04/2024-18:25:05] [V] [TRT] Disabling unused tactic source: CUDNN
[01/04/2024-18:25:05] [V] [TRT] Disabling unused tactic source: CUBLAS, CUBLAS_LT
[01/04/2024-18:25:05] [V] [TRT] Disabling unused tactic source: EDGE_MASK_CONVOLUTIONS
[01/04/2024-18:25:05] [V] [TRT] Disabling unused tactic source: JIT_CONVOLUTIONS

I have attached my onnx model and trt engine for you to reproduce. Thanks! Desktop.zip

So my question is: am I doing the sparsity correctly and if not, how to get the claimed speedup from adding structured sparsity?

nvoliver commented 10 months ago

Closing since this is tracked in https://forums.developer.nvidia.com/t/sparsity-does-not-provide-any-speedup-for-tensorrt-on-dla/278355