OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.34k stars 1.49k forks source link

Is it possible to use asimd with half-precision float on armv8a or neon on armv7a for accelerating ? #1181

Closed aswywy closed 7 years ago

aswywy commented 7 years ago

Hi: I'm using openblas for Neural Network on mobile with arm. I use caffe1 with openblas, and it works well,but not quick enough. It seems that Neural Network do not need float operation with full-precision,half-precision is also fine, even 8-bit-precision is ok. Is it possible to use asimd with half-precision float on armv8a or neon on armv7a for accelerating primitive openration such as convlution, matrix multiplation? thanks very much.

martin-frbg commented 7 years ago

There is currently no support for half-precise floats as far as I am aware. This was discussed quite some time ago, though only in the context of the Intel F16C instruction set where I believe the consensus was that the drawbacks would outweigh any benefits (see #694)

brada4 commented 7 years ago

ASIMD is same - only half-precision instruction is to expand it to single precision. Only effect would be if memory bandwidth is the only bottleneck. is neural network SGEMM? What input sizes?

aswywy commented 7 years ago

Thanks for reply. @martin-frbg @brada4 .But I find a ComputeLibrary in github maintained by ARM. https://github.com/ARM-software/ComputeLibrary

here is the low-precision GEMM operation:

https://github.com/ARM-software/ComputeLibrary/blob/master/src/core/NEON/kernels/NEGEMMLowpMatrixMultiplyKernel.cpp

I think it takes advance of NEON simd mechanism and make multiply in low percision float.

As @brada4 said

ASIMD is same - only half-precision instruction is to expand it to single precision. Is there any other official document that supports the argument.

@brada4 I do not use a specific NN. I found that in FPGA area, they cut down the precision of network params to save store memory and bandwidth, and also save hardware logic area. I also found that arm's neon simd mechanism can put several 16bit float as a vector,and use just one instruction to do add. So I wonder if it is possible to use asimd to put several 16bit float as a vector,and compute the gemm with less simd instructions to speed up the prediction of NN. The final goal is use the NN on a mobile phone.

ashwinyes commented 7 years ago

I think it takes advance of NEON simd mechanism and make multiply in low percision float.

No it does not. It multiplies integers not floats. All the NEON intrinsics used there are in the form _s32, _s16 etc. which means they are operating on signed integers. If it were floats, it would be of form _f32, _f64 etc.

ASIMD is same - only half-precision instruction is to expand it to single precision.

This is not completely correct. Armv8.2-A has added the half precision float support (https://community.arm.com/processors/b/blog/posts/armv8-a-architecture-evolution). But I don't think, as of today, you will be able to find any armv8 processor in market which supports Armv8.2-A.

On processors supporting specifications earlier to Armv8.2-A, I think you will have load half precision floats, convert them to single precision using FCVTL instructions, do the operations and convert back using FCVTN instructions.

brada4 commented 7 years ago

Kind of no time savings doubling load/store time...(software optimization guide a57) @aswywy - if you dont know matrix size nobody can help you to measure if wrong threading threshold hurts you. I dont see any realistic scenario how you can have your fpga in significant population of customer mobile phones. E.g. OpenCL is more likely to have half-precision float support, but thats not omnipresent either.

aswywy commented 7 years ago

Thanks a lot @ashwinyes @brada4 . I learnt a lot. @ashwinyes I made a mistake,NEGEMMLowpMatrixMultiplyKernel.cpp is aimed at "Matrix Multiplication with Quantized matrices", as follow link: http://www.netlib.org/utk/people/JackDongarra/WEB-PAGES/Batched-BLAS-2017/talk12-gurney.pdf Is that right? Is openblas support the same functions?

as @brada4 said:

if you dont know matrix size nobody can help you to measure if wrong threading threshold hurts you.

I know it is important to know the size of matrix for optimize the NN argorithm. But I just does not decide which network to use. What I want is use mobile to do objects detect with speed. Do you have any suggestion?

BTW I do not understand what "wrong threading threshold " means ,can you give an example?

as @brada4 said:

E.g. OpenCL is more likely to have half-precision float support, but thats not omnipresent either

Remind me to see the opencl implements in arm compute library.

https://github.com/ARM-software/ComputeLibrary/blob/master/src/core/CL/kernels/CLGEMMMatrixMultiplyKernel.cpp

seems uses F16 as input : ARM_COMPUTE_ERROR_ON_DATA_TYPE_CHANNEL_NOT_IN(input0, 1, DataType::F16, DataType::F32);

@brada4 I do not want to use FPGA with mobile phones, just want to follow the method which FPGA used.

Thanks again!

ashwinyes commented 7 years ago

https://github.com/ARM-software/ComputeLibrary/blob/master/src/core/CL/kernels/CLGEMMMatrixMultiplyKernel.cpp seems uses F16 as input : ARM_COMPUTE_ERROR_ON_DATA_TYPE_CHANNEL_NOT_IN(input0, 1, DataType::F16, DataType::F32);

Yes. Looks like it supports F16 input type. But doesn't look like it has any ASIMD (NEON) implementation for that.

aswywy commented 7 years ago

@ashwinyes Yes. Looks like it supports F16 input type. But doesn't look like it has any ASIMD (NEON) implementation for that.

Yes, I know : ) It is Opencl code. Another kind of simd : )

as @brada4 said:

E.g. OpenCL is more likely to have half-precision float support, but thats not omnipresent either

Remind me to see the opencl implements in arm compute library.

https://github.com/ARM-software/ComputeLibrary/blob/master/src/core/CL/kernels/CLGEMMMatrixMultiplyKernel.cpp

Thanks for replying : )

brada4 commented 7 years ago

But you are 100% permitted to use ARM compute library once you get over the fact that it will not use ASIMD. Also you only need _GEMM, and not any of 50-some other BLAS functions

aswywy commented 7 years ago

@brada4 Thanks I think openblas is my first choice, in fact i compiled caffe with openblas on my mobile. I ran SSD on the mobile wih armv7 snapdrogan 801. for 1280*720 picture,It takes about 4 seconds to find objects. https://github.com/weiliu89/caffe/tree/ssd It is too slow. So I just curious about how to accelerate NN perdiction. I'm new to GEMM, the core of NN perdiction. So i came here looking for help. It is very kind of you,and i learnt a lot.

martin-frbg commented 7 years ago

There is a semi-abandoned "optimized for deep learning" branch here that appears to contain just this simple change to GEMM workload splitting between threads, perhaps it is useful: https://github.com/xianyi/OpenBLAS/commit/92058a75

brada4 commented 7 years ago

You could try to find optimal number of threads. As caffe does it to run gemm on smaller and smaller subsets to sniff out features. Current OpenBLAS has one threshold between one and all threads, all what falls close after this threshold may not be golden for machine learning (interface/gemm.c is the place you can adjust weights for threading)

aswywy commented 7 years ago

@martin-frbg @brada4 Thanks ,I will try. I will post the result if I get one : )