OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.36k stars 1.49k forks source link

optimizing sgemm_kernel routine in openblas #599

Open vipin387 opened 9 years ago

vipin387 commented 9 years ago

we profiled nnet-latgen-faster and found that 40% of execution time is taken by sgemm_kernel ,we optimized the routine through arm intrisics and had also used -vectorize and -O3 compilation option in the Makefile.arm,but there was no reduction in the execution time of the sgemm_kernel routine this was verified by the timestamps.Where as through the same options when we tested on sample standalone application the execution time reduced by three times.Any help would be appreciated

xianyi commented 9 years ago

What's the input matrix size of nnet-latgen-faster? Is it the same to your standalone application?

vipin387 commented 9 years ago

Actually in the Standalone application the matrix is A=(500x500) B=(500x500) C=(500x500) and call is sgemm_kernel(500,500,500, 0.5, A, B, C, 500); sgemm_kernel is called only for once for the complete execution ,But When integrated with speech engine ,the same routine of sgemm_kernel is called many times depending on the number of frames for the speech data ,but the dimensions are less (sgemm_kernel,bm=90,bn=6,bk=180.=360 ldc)(sgemm_kernel,bm=90,bn=36,bk=180.=360 ldc),logs are something similar to this ,there is variation in bm,bn,bk and ldc .It is called almost (927 times for a audio file of 294 frames).It is taking around 80 ms for complete execution for all this ,but it remains same with the -mfpu=neon ,-vectorize,-O3 ,-fast-math compilation options and arm neon intrinsics.Whereas in standalone application it came down from 675ms to 180 ms for (500x500).It would be good if some light is thrown on this behaviour.