cooooorn / Pytorch-XNOR-Net

XNOR-Net, with binary gemm and binary conv2d kernels, support both CPU and GPU.
BSD 3-Clause "New" or "Revised" License
82 stars 23 forks source link

XNOR acceleration #8

Open wqrray opened 5 years ago

wqrray commented 5 years ago

Thanks to the implement of XNOR by CUDA and pytorch, it really helps me. I'm now wondering if the implementation can really speed up the training process. After doing some experiment about MNIST, the speed of Bin_LeNet seems slower than LeNet, which seems unreasonable, so can you explain how to accelerate the training process? Thanks a lot.

flyingpot commented 5 years ago

@wqrray I meet this problem too. And after my test on VGG19 and LeNet model, only the speed of Bin_VGG19 is faster than VGG19 when not using cuda. But Bin_LeNet is slower no matter using cuda or not. I want to know why.

@cooooorn Could you tell me the theoretical speedup ratio between Bin_Net and initial network?

wqrray commented 5 years ago

@flyingpot I have checked the Bin_LeNet code and find it only uses xnor in the test part, which means that xnor is not used in the training part. I'm now trying to change the code to use xnor in training, but some problems related to dimension when backwarding really troubles me. Do you plan to change the code to implement xnor in the training for forward and backward?

flyingpot commented 5 years ago

@wqrray I think XNOR-Net aims to make testing phase faster and make the model smaller. Since float numbers are still used in the training phase, the speedup may not be high.

In my experiments, the Bin_LeNet is slower than LeNet even in the testing phase. I don't know why.

wqrray commented 5 years ago

@flyingpot According to the paper, the author tries to use xnor in both forward and backward to accelerate training, so I try to implement that. I also have the problem of Bin_LeNet not much faster than LeNet. Their speeds have no great difference. I guess the extra procedure of binarization and getting value alpha need some time. By the way, do you know why we need to divide 32 in the BinCov2d layer? self.weight = nn.Parameter(torch.IntTensor(out_channels, 1 + ( in_channels self.kernel_size[0] self.kernel_size[1] - 1) // 32))

flyingpot commented 5 years ago

@wqrray Yeah, you are right. But I think the backward is much slower than forward, so the speedup cannot be good enough for training.

The author uses integers to save binary weights, and an integer can represent 32 bits. So this line means allocating spaces for binary weights used for the testing phase.

wqrray commented 5 years ago

@flyingpot Thank you for answering these questions! I now wonder if the xnor acceleration can really be implemented. I also try Torch in Lua based on the author's code, but since I have problems in adding new layer in Torch, so I now try the Pytorch. I doubt that the 58 times acceleration is just by coincidence.

cooooorn commented 5 years ago

If your pytorch vision is 0.4.0 or higher, the speed will be much more slower than the version 0.3.1 due to the '.data' changed.

In general, the speed of GPU kernel is slower than non-binarized model during forward pass, which use cublas to do matrix multiplication.

It's very difficult for me to optimize cuda codes so that this kernel can run as fast as cublas, this was my first time writing cuda codes though I had written many cpp codes.

According to the Binarized Neural Networks, a theoretical Nvidia GPU speed-up of factor of 32/6 ≈ 5.3.

However, the CPU kernel is about 2x faster compared with pytorch v0.3.1 during forward pass, which is more meaningful for devices with limited computing power.

By the way, I had tried Intel's SIMD instructions (SSE4.2, AVX2), but it run slower than 'asm popcnt' unexpectedly. (Maybe AVX512 can save this? I don't know.)

flyingpot commented 5 years ago

@cooooorn As you said, the binarized model is 2x faster than non-binarized model. However, the original paper of XNORNet said "With the current generation of CPUs, we can perform 64 binary operations in one clock of CPU." I would like to know how to achieve the 2x acceleration in your code or where does the acceleration happen? Is the binarized multiplication happens in dgemm_micro_kernel function? And is it possible to make the acceleration ratio higher in CPU? Thank you!

cooooorn commented 5 years ago

@flyingpot Send me your qq by email, if you want to know more details about the implementation.

lucamocerino commented 5 years ago

Hi guys, I benchmarked the code both on GPU and CPU with PyTorch 0.4 but the fp32 model is still faster then binarized in test mode. How is it possible!?

kaivu1999 commented 5 years ago

Hi @cooooorn

I am also working on getting real speed up of XNOR on cpu or gpu. Can you tell about the speed up that can be achieved on inference as compared to fp. You have followed XNOR-Net implementation right ! ( Image from their paper for reference ) Can you tell me about these multiplication operations (output_2D) x K x alpha in the code ? Screenshot from 2019-07-23 18-02-47 Thanking you !

fallingstar62 commented 2 years ago

@flyingpot Send me your qq by email, if you want to know more details about the implementation.

我想对matmul.h里的细节多了解一些,这是我的QQ:958326896,谢谢

cooooorn commented 2 years ago

https://www.mathematik.uni-ulm.de/~lehn/apfel/sghpc/gemm/page02/index.html

fallingstar62 @.***> 于2022年4月2日周六 19:11写道:

@flyingpot https://github.com/flyingpot Send me your qq by email, if you want to know more details about the implementation.

我想对matmul.h里的细节多了解一些,这是我的QQ:958326896,谢谢

— Reply to this email directly, view it on GitHub https://github.com/cooooorn/Pytorch-XNOR-Net/issues/8#issuecomment-1086608141, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGVIANYAG4PDNKQ5CTLBD5LVDAMOZANCNFSM4HDBCKVA . You are receiving this because you were mentioned.Message ID: @.***>

fallingstar62 commented 2 years ago

https://www.mathematik.uni-ulm.de/~lehn/apfel/sghpc/gemm/page02/index.html fallingstar62 @.> 于2022年4月2日周六 19:11写道: @flyingpot https://github.com/flyingpot Send me your qq by email, if you want to know more details about the implementation. 我想对matmul.h里的细节多了解一些,这是我的QQ:958326896,谢谢 — Reply to this email directly, view it on GitHub <#8 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGVIANYAG4PDNKQ5CTLBD5LVDAMOZANCNFSM4HDBCKVA . You are receiving this because you were mentioned.Message ID: @.>

Thanks, but I'm still confused about : image