Where can I see examples of WMMA GEMM usage for INT1 (bit 1)?

AlexeyAB commented 6 years ago

Does the CUTLASS 1.2 library really support INT1 (1 bit) GEMM by using Tensor Cores, so can we use it for XNOR neural networks?
Does it perform XNOR !(a^b) operations instead of Multiply?
Does it perform C[j][i] = popcnt( A_i_row[x] XNOR B_j_col[x] ) ?
Should we pack each 32 bits into uint32_t (A along row, B along column) in such a maner as in cuDNN, where we should use CUDNN_DATA_INT8x32 and CUDNN_TENSOR_NCHW_VECT_C to use INT8 on Tensor Cores with CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM? https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#tensor-ops-speedup-tips
Where can I read more about this and where can I see examples of Warp-Level Matrix Operations (WMMA) GEMM usage for INT1 (1 bit)?

I can see only tests for INT8 and INT4: https://github.com/NVIDIA/cutlass/blob/master/tools/test/unit/gemm/wmma_integer_gemm.cu

As written here we can achieve 2088 TOPS for INT1 (1 bit) on GeForce RTX 2080 Ti (TU102): http://on-demand.gputechconf.com/gtc-il/2018/pdf/sil8140-optimizing-cuda-applications-for-the-volta-turing-gpu-architecture.pdf

https://github.com/NVIDIA/cutlass#whats-new-in-cutlass-11

WMMA GEMM targeting TensorCores - INT8, INT4, 1-bit https://github.com/NVIDIA/cutlass/blob/master/tools/test/unit/gemm/wmma_integer_gemm.cu

From the last newsletter:

CUTLASS 1.2, the latest version of the CUDA template library for linear algebra subroutines, includes the following key updates:

Support for Turing Tensor Cores that significantly speedup matrix computations for deep learning inference

Tensor Core optimized WMMA GEMMs for the new INT8, INT4, and INT1 precision modes introduced in Turing

Support for batched strided GEMMs, parallelized GEMM-K reductions, enhanced utilities, and samples

d-k-b commented 6 years ago

You can see an example in the perf tests at https://github.com/NVIDIA/cutlass/blob/master/tools/test/perf/gemm/wmma_binary_gemm.cu.

d-k-b commented 6 years ago

The implementation is modeled here: https://github.com/NVIDIA/cutlass/blob/ed2ed4d667ce95e1371bd62db32b6a114e774336/tools/util/reference/detail/inner_product.h#L51-L61 .

AlexeyAB commented 5 years ago

If anyone is interested, I implemented neural network for object detection - XNOR-Yolo model (bit-1 precision) on Darknet framework with Tensor Cores: https://github.com/AlexeyAB/darknet/issues/2365#issuecomment-462923756

Model	RTX 2070 `CUDNN_HALF=0`, ms	RTX 2070 `CUDNN_HALF=1`, ms	Speedup X times
yolov3-spp.cfg 608x608 Float-32/16 bit precision	40.9	27.2 (Tensor Cores for floats)	1.5x
yolov3-spp_xnor_obj.cfg.txt 608x608 CC7.5 (Tensor Cores for XNOR) Bit-1 precision	13.5	13.2	1.0x
Speedup X times	3.0x	2.0x	-

XNOR-net training process: chart_yolov3-spp_xnor_obj

AlexeyAB commented 5 years ago

@d-k-b Hi,

Are there any approximate dates when the Device-Wide bin1_t-GEMM function that uses Tensor Cores will appear in the cutlass?

NVIDIA / cutlass

Where can I see examples of WMMA GEMM usage for INT1 (bit 1)? #34