Ampere sparsity - Githubissues

JoeCool90 commented 3 years ago

Hi, any thoughts on the new sparsity features of the nvidia Ampere gpus? Looks like it could give a big speed improvement if applicable:

A100 whitepaper GA102 whitepaper

AlexeyAB commented 3 years ago

Even without sparsity, for YOLOv4 the training will be 6x times faster on Ampere, since there is TF32 for all Ampere GPUs (RTX 3070 - 3090, Tesla A100). And inference will be 2x faster.

And yes, it seems Sparsity can be used for Pruning, just to prune (set to zero) the smalles 2 of 4 sequential weights values, and it should increase inference time 2x (so in total inference will be 4x times faster than by using Turing).

https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf

NVIDIA has developed a simple and universal recipe for sparsifying deep neural networks for inference using this 2:4 structured sparsity pattern. The network is first trained using dense weights, then fine-grained structured pruning is applied, and finally the remaining non-zero weights are fine-tuned with additional training steps. This method results in virtually no loss in inferencing accuracy based on evaluation across dozens of networks spanning vision, object detection, segmentation, natural language modeling, and translation.

JoeCool90 commented 3 years ago

Cool. The fp16 performance doesn't seem to be that much better than a 2080ti (e.g. this bench on resnet but maybe when the tensorrt support comes out that will change? I don't know.

AlexeyAB / darknet

Ampere sparsity #6802