Reference-LAPACK / lapack

LAPACK development repository
Other
1.46k stars 430 forks source link

vectorization of lapack routines. #1003

Closed AjaySingh40 closed 2 months ago

AjaySingh40 commented 3 months ago

I am new to Lapack and trying to know whether vectorization of Lapack routine is possible like BLAS routines. If possible it would be great if someone can give me idea for the same. Thank You.

pradeeptrgit commented 3 months ago

AMD's LAPACK implementation, AOCL-LAPACK, has vectorized implementation for few functions of LAPACK. You can refer in links below. Repo: https://github.com/amd/libflame Directory containing vectorized code: https://github.com/amd/libflame/tree/master/src/lapack/x86

AjaySingh40 commented 3 months ago

Thank You for your response. Is there any update for vectorization on ARM.

pradeeptrgit commented 3 months ago

Not aware of any ARM based vectorization

AjaySingh40 commented 3 months ago

I went through some of the routines of LAPACK , they rely on BLAS routines for basic algebraic operations. It has just wrapper routines that call other routines. According to these statement can we conclude that vectorizing BLAS will do the job and it doesn't make sense to vectorize Lapack routines (As vectorization gives performance when we to heavy/complex operations/calculations), correct me if I am wrong. Thank You.

langou commented 3 months ago

I went through some of the routines of LAPACK , they rely on BLAS routines for basic algebraic operations. It has just wrapper routines that call other routines. According to these statement can we conclude that vectorizing BLAS will do the job and it doesn't make sense to vectorize Lapack routines (As vectorization gives performance when we to heavy/complex operations/calculations), correct me if I am wrong.

Yes this is the idea. You use an optimized BLAS library under LAPACK and you should get decent performance from LAPACK thanks to the optimized BLAS library that is tuned/vectorized/parallelized for your architecture.

Note that, in addition to use an optimized BLAS, you might as well want to tune the block sizes that LAPACK uses.

In any case, "if you use an optimized BLAS with LAPACK, you will get good performance" is the paradigm under which LAPACK has been built. This paradigm is not always true, but it has proven to be a good rule of thumb to write sustainable and portable software.

The statement "it doesn't make sense to vectorize Lapack routines" is a tat too strong. It can make sense in some cases to vectorize some LAPACK routines. This reference LAPACK library does not do this though. There are some optimized LAPACK libraries where some intrinsic LAPACK routines (in addition to the BLAS ones) are vectorized, tuned for the architecture, etc. and this does bring some performance. It matters for some specific problems.

As a start, try to find an optimized LAPACK library for your architecture. If you find one, that's probably a good start.

If there is no optimized LAPACK library for your architecture, use this reference LAPACK library with an optimized BLAS and see if you have decent performances. If not, try to tune the block size or change the VARIANTS of the algorithm in LAPACK. If not, well, yes, you can try to do some profiling to see if you can bring more performance by vectorizing some specific intrinsic LAPACK routines. But, it can also be the case that the performances that you observe are what they are and to improve these performances would require new algorithms, new ideas, etc.

AjaySingh40 commented 3 months ago

Thanks @langou will try it.

AjaySingh40 commented 3 months ago

I tried running different examples of lapack on sve implemented blas library. I am getting error in LAPACKE_ssyev() routine when the size is greater than 64(N>64), If increase to higher like 1000 or more , it gives segmentation fault for both symmetric and unsymmetric matrices. while it runs without any issue in case of blas without sve. What could be the cause of it? Time measured: 0.042 seconds. The algorithm failed to compute eigenvalues. LAPACKE_ssyev (row-major, high-level) Example Program Results for 65 Thank You.