Closed agnonchik closed 6 months ago
Titan Z is actually 2 Kepler GPUs on 1 board. Our kernel was developed on a single Maxwell GPU (GTX 750). The different architecture and layout might explain the drop in performance.
Our XNOR kernel is currently shared memory bounded. To relieve the shared memory bottleneck, I would suggest to: 1) implement some 8x8 register tiling 2) minimize the shared memory bank conflicts 3) For a Maxwell GPU, try to keep 256 active threads per thread block, and >= 2 active thread blocks per SMP. You might want to make different adjustments for Kepler, Pascal and Volta GPUs.
Here is a tutorial on how to write blazing fast GEMM for Maxwell GPUs: https://github.com/NervanaSystems/maxas/wiki/SGEMM
Hi Matthieu, thanks for your suggestions!
I found that the efficiency of XNOR kernel, in its current implementation, strongly depends upon matrix dimensions. If the kernel computes C=AB where A and B are MxN and NxK matrices, respectively, MLP performs three matrix-matrix multiplications at Run-time: M, N, K 10000, 4096, 4096 10000, 4096, 4096 10000, 4096, 10
For the first two multiplications, XNOR beats cublas with 0.065 s against 0.103 s For the third multiplication, XNOR is slower taking 0.017 s against 0.003 s taken by cublas.
Some low-level libraries optimized for binary operations would solve the issue.
Thanks for the explanation!
Your XNOR Conv2d do not use any of FAST multiplication algorithms (like winograd) so it just cannot be faster
I've got the following benchmarking results for kernels on TITAN Z GPU: Baseline - 2.642s, Theano - 0.582s, XNOR - 0.988s :-( You can see that Theano is faster than XNOR which is not consistent with the main point of the referenced paper. Do you have any idea why it happens? How to tweak XNOR to beat Theano?
P.S.: matrix multiplication gives: GEMM - 2.788s, cublasSgemm - 0.331s, XNOR GEMM - 0.182s which is quite OK.