MLP with XNOR kernel is slower than theano.tensor.dot on MNIST dataset

MatthieuCourbariaux / BinaryNet

Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1

BSD 3-Clause "New" or "Revised" License

1.04k stars 346 forks source link

MLP with XNOR kernel is slower than theano.tensor.dot on MNIST dataset #24

Closed agnonchik closed 6 months ago

agnonchik commented 6 years ago

I've got the following benchmarking results for kernels on TITAN Z GPU: Baseline - 2.642s, Theano - 0.582s, XNOR - 0.988s :-( You can see that Theano is faster than XNOR which is not consistent with the main point of the referenced paper. Do you have any idea why it happens? How to tweak XNOR to beat Theano?

P.S.: matrix multiplication gives: GEMM - 2.788s, cublasSgemm - 0.331s, XNOR GEMM - 0.182s which is quite OK.

MatthieuCourbariaux commented 6 years ago

Titan Z is actually 2 Kepler GPUs on 1 board. Our kernel was developed on a single Maxwell GPU (GTX 750). The different architecture and layout might explain the drop in performance.

Our XNOR kernel is currently shared memory bounded. To relieve the shared memory bottleneck, I would suggest to: 1) implement some 8x8 register tiling 2) minimize the shared memory bank conflicts 3) For a Maxwell GPU, try to keep 256 active threads per thread block, and >= 2 active thread blocks per SMP. You might want to make different adjustments for Kepler, Pascal and Volta GPUs.

Here is a tutorial on how to write blazing fast GEMM for Maxwell GPUs: https://github.com/NervanaSystems/maxas/wiki/SGEMM

agnonchik commented 6 years ago

Hi Matthieu, thanks for your suggestions!

I found that the efficiency of XNOR kernel, in its current implementation, strongly depends upon matrix dimensions. If the kernel computes C=AB where A and B are MxN and NxK matrices, respectively, MLP performs three matrix-matrix multiplications at Run-time: M, N, K 10000, 4096, 4096 10000, 4096, 4096 10000, 4096, 10

For the first two multiplications, XNOR beats cublas with 0.065 s against 0.103 s For the third multiplication, XNOR is slower taking 0.017 s against 0.003 s taken by cublas.

Some low-level libraries optimized for binary operations would solve the issue.

MatthieuCourbariaux commented 6 years ago

Thanks for the explanation!

MaratZakirov commented 6 years ago

Your XNOR Conv2d do not use any of FAST multiplication algorithms (like winograd) so it just cannot be faster