NVIDIA / cutlass

CUDA Templates for Linear Algebra Subroutines
Other
5.62k stars 960 forks source link

Where can I see examples of WMMA GEMM usage for INT1 (bit 1)? #34

Closed AlexeyAB closed 5 years ago

AlexeyAB commented 5 years ago

I can see only tests for INT8 and INT4: https://github.com/NVIDIA/cutlass/blob/master/tools/test/unit/gemm/wmma_integer_gemm.cu


As written here we can achieve 2088 TOPS for INT1 (1 bit) on GeForce RTX 2080 Ti (TU102): http://on-demand.gputechconf.com/gtc-il/2018/pdf/sil8140-optimizing-cuda-applications-for-the-volta-turing-gpu-architecture.pdf

https://github.com/NVIDIA/cutlass#whats-new-in-cutlass-11

WMMA GEMM targeting TensorCores - INT8, INT4, 1-bit https://github.com/NVIDIA/cutlass/blob/master/tools/test/unit/gemm/wmma_integer_gemm.cu

From the last newsletter:

CUTLASS 1.2, the latest version of the CUDA template library for linear algebra subroutines, includes the following key updates:

  • Support for Turing Tensor Cores that significantly speedup matrix computations for deep learning inference
  • Tensor Core optimized WMMA GEMMs for the new INT8, INT4, and INT1 precision modes introduced in Turing
  • Support for batched strided GEMMs, parallelized GEMM-K reductions, enhanced utilities, and samples
d-k-b commented 5 years ago

You can see an example in the perf tests at https://github.com/NVIDIA/cutlass/blob/master/tools/test/perf/gemm/wmma_binary_gemm.cu.

d-k-b commented 5 years ago

The implementation is modeled here: https://github.com/NVIDIA/cutlass/blob/ed2ed4d667ce95e1371bd62db32b6a114e774336/tools/util/reference/detail/inner_product.h#L51-L61 .

AlexeyAB commented 5 years ago

If anyone is interested, I implemented neural network for object detection - XNOR-Yolo model (bit-1 precision) on Darknet framework with Tensor Cores: https://github.com/AlexeyAB/darknet/issues/2365#issuecomment-462923756

Model RTX 2070 CUDNN_HALF=0, ms RTX 2070 CUDNN_HALF=1, ms Speedup X times
yolov3-spp.cfg 608x608 Float-32/16 bit precision 40.9 27.2 (Tensor Cores for floats) 1.5x
yolov3-spp_xnor_obj.cfg.txt 608x608 CC7.5 (Tensor Cores for XNOR) Bit-1 precision 13.5 13.2 1.0x
Speedup X times 3.0x 2.0x -

XNOR-net training process: chart_yolov3-spp_xnor_obj

AlexeyAB commented 5 years ago

@d-k-b Hi,

Are there any approximate dates when the Device-Wide bin1_t-GEMM function that uses Tensor Cores will appear in the cutlass?