SHI-Labs / Neighborhood-Attention-Transformer

Neighborhood Attention Transformer, arxiv 2022 / CVPR 2023. Dilated Neighborhood Attention Transformer, arxiv 2022
MIT License
1.05k stars 86 forks source link

throughput of nat_tiny vs resnet50 #38

Closed jiaojile closed 2 years ago

jiaojile commented 2 years ago

hi~ I find that the same size models, nat_tiny and resnet50, have very different throughput on NVIDIA GeForce 2080Ti? How about the comparison in your machine? time_nat_tiny time_resnet50 (Plz don't care about the accuracy in the image, the input is not the ImageNet test set)

alihassanijr commented 2 years ago

Hello and thank you for your interest. First off, most of our benchmarking was done on A100s, not 2080 Tis, so I wouldn't expect to get really excellent performance. We basically debugged and developed the kernel on a different architecture.

Secondly, I noticed you're not using mixed precision, which when enabled can really push both models further, but it doesn't have the same effect on two different models.

Thirdly, I can confirm that I got around the same throughput on NAT-Tiny with a 2080, which is around 340 imgs/sec with the default batch size, but I only got 666 imgs/sec with ResNet50, so not sure what's going on there. Not too surprising though, since that time includes I/O overheads typically, so it might be a little different depending on your setup.

image

image

Also, just fyi, this is how they'll run with AMP:

image

image

Notice how little it affected ResNet50, but how much closer NAT-Tiny is now to ResNet50 in throughput. It's not really amp's shortcoming either, it's mostly that the modules ResNet50 primarily uses probably have really good full precision modes that half precision just doesn't end up making that big of a difference.

If your question was why it's slower now in general, let me know and I'll get into details.

jiaojile commented 2 years ago

Thank you for the quick and detailed answer. The throughput improvement for both models by using mixed precision are as follows on my 2080Ti. Despite little effect on ResNet50, it's still nearly twice as fast as NAT-Tiny. I wonder what is the throughput comparison on A100s? NAT-Tiny: from 340 imgs/sec to 550 imgs/sec ResNet50: from 890 imgs/sec to 1050 imgs/sec nat_tiny_amp resnet50_amp

BTW, have you tried to do INT8 quantization with NVIDIA TensorRT for NAT?

alihassanijr commented 2 years ago

Again NAT's running similarly on my end, but I'm not sure why I can't run ResNet as fast. It could be a wide range of reasons varying from CUDA version, torch version, and so on. Even hardware to be honest.

Either way, the fact that ResNet50 is ahead is not too surprising, ResNet's using mainly convs which usually use either PyTorch's kernels or cuDNN's, both of which are much faster and more optimized than the current version of NAT.

alihassanijr commented 2 years ago

Closing this due to inactivity. If you still have questions feel free to open it back up.