Closed AlexeyAB closed 5 years ago
You can see an example in the perf tests at https://github.com/NVIDIA/cutlass/blob/master/tools/test/perf/gemm/wmma_binary_gemm.cu.
The implementation is modeled here: https://github.com/NVIDIA/cutlass/blob/ed2ed4d667ce95e1371bd62db32b6a114e774336/tools/util/reference/detail/inner_product.h#L51-L61 .
If anyone is interested, I implemented neural network for object detection - XNOR-Yolo model (bit-1 precision) on Darknet framework with Tensor Cores: https://github.com/AlexeyAB/darknet/issues/2365#issuecomment-462923756
Model | RTX 2070 CUDNN_HALF=0 , ms |
RTX 2070 CUDNN_HALF=1 , ms |
Speedup X times |
---|---|---|---|
yolov3-spp.cfg 608x608 Float-32/16 bit precision | 40.9 | 27.2 (Tensor Cores for floats) | 1.5x |
yolov3-spp_xnor_obj.cfg.txt 608x608 CC7.5 (Tensor Cores for XNOR) Bit-1 precision | 13.5 | 13.2 | 1.0x |
Speedup X times | 3.0x | 2.0x | - |
XNOR-net training process:
@d-k-b Hi,
Are there any approximate dates when the Device-Wide bin1_t-GEMM function that uses Tensor Cores will appear in the cutlass?
Does the CUTLASS 1.2 library really support INT1 (1 bit) GEMM by using Tensor Cores, so can we use it for XNOR neural networks?
Does it perform XNOR
!(a^b)
operations instead of Multiply?Does it perform
C[j][i] = popcnt( A_i_row[x] XNOR B_j_col[x] )
?Should we pack each 32 bits into uint32_t (A along row, B along column) in such a maner as in cuDNN, where we should use
CUDNN_DATA_INT8x32
andCUDNN_TENSOR_NCHW_VECT_C
to use INT8 on Tensor Cores withCUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM
? https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#tensor-ops-speedup-tipsWhere can I read more about this and where can I see examples of Warp-Level Matrix Operations (WMMA) GEMM usage for INT1 (1 bit)?
I can see only tests for INT8 and INT4: https://github.com/NVIDIA/cutlass/blob/master/tools/test/unit/gemm/wmma_integer_gemm.cu
As written here we can achieve 2088 TOPS for INT1 (1 bit) on GeForce RTX 2080 Ti (TU102): http://on-demand.gputechconf.com/gtc-il/2018/pdf/sil8140-optimizing-cuda-applications-for-the-volta-turing-gpu-architecture.pdf
https://github.com/NVIDIA/cutlass#whats-new-in-cutlass-11
From the last newsletter: