Peak Performance INT1, INT4, INT8, INT16, INT32 for RTX3090 Tensorcore

NVIDIA / cutlass

CUDA Templates for Linear Algebra Subroutines

Other

5.36k stars 904 forks source link

Peak Performance INT1, INT4, INT8, INT16, INT32 for RTX3090 Tensorcore #195

Closed YukeWang96 closed 3 years ago

YukeWang96 commented 3 years ago

Hi,

is there any reference for the peak performance of INT1, INT4, INT8, INT16, INT32 for RTX3090 on Tensorcore? Just want to compare my current CUTLASS GEMM versus the theoretical peak performance.

Thanks!

hwu36 commented 3 years ago

Page 44 of https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf

As to running tensor cores on 3090, see this https://discuss.tvm.apache.org/t/rfc-byoc-nvidia-cutlass-integration/9147/24?u=hwu36

YukeWang96 commented 3 years ago

Thanks a lot!