Reference-LAPACK / lapack

LAPACK development repository
Other
1.46k stars 430 forks source link

Introduce cpp (and also adds lasr3 and steqr3) #991

Open thijssteel opened 4 months ago

thijssteel commented 4 months ago

related to issue #988

This PR introduces some C++ code to LAPACK.

The general design for the C++ code is:

A lot of things are introduced in this PR:

Directory structure

This PR adds some new folders:

- lapack_cpp
   - example
   - include
      - lapack_c
      - lapack_cpp 
   - src
      - lapack_c
      - lapack_cpp
      - lapack_fortran 

Because we implement the cpp code using templates, they are all defined in headers, which reside in the include folder. The src folder mostly contains wrappers.

There are three main routines taking part in this PR: lasr3, rot, and lartg.

This directory structure is not necessarily optimal. Perhaps it is better to put the headers and source files in the same directory for easy navigation (it is trivial for build tools to copy them to an include directory)

Building

Currently, the example can be built using GNU Make, but ideally, there would be a proper build which also includes the newly added lasr3 algorithm in the lapack binary, both using GNU Make and Cmake.

Workspace

An inevitable discussion for any LAPACK variant is workspace. I chose the following design:

Matrix class

A matrix/vector is just lightweight wrapper around a pointer and a few integers defining its leading dimension and size. Because it knows its size, it can internally check for out of bounds access, with much more precision than something like valgrind ever could. The following code

real A(50,50)
call srot( 100, A(1, 1), lda, A(2, 1), lda, c, s) 

would fail inside srot. Not only could the matrix class have detected such a mistake at its source, it is actually impossible to write the mistake.

Matrix A(50, 50, memoryblock);
rot(A.row(1), A.row(2), c, s); 

A question that was posed is: why don't we use mdspan or Eigen? Eigen matrices own their data, making it a bit more difficult to make fortran wrappers. The expression template system and fixed size matrices, while great for many applications are also extra features that would take effort to fully support and I don't think it is appropriate for LAPACK. mdspan is a closer match to our matrix design, but it again has extra features that I don't want to have to support. An mdspan matrix can have a more general layout that just column/row major. Because it is also a thin wrapper around some data, it is trivial to convert between the two matrix types (so long as they have a compatible stride), so users using mdspan should be able to use the cpp interface without much effort.

About optimized BLAS/LAPACK

Linking the C++ code to optimized BLAS or even LAPACK can achieved through the c/fortran wrappers. But vendors could eventually choose to just provide their own binaries for those wrappers, essentially linking our C++ with their C/C++ code, without going through Fortran.

About the algorithm

Why would we care about efficiently applying givens rotations? Essentially, because of eigenvalues. The implicit QR version of the SVD and symmetric QR algorithm do not achieve high flop rates, mainly because of the lack of an efficient method to apply givens rotations. The routine to calculate a Hessenberg-triangular reduction achieves high efficiency by accumulating blocks of rotations into an orthogonal matrix and then using gemm, but that involves extra flops (not just to do the accumulation, but also because the orthogonal matrix for kk rotations, is usually of size (2(k+1))(2(k+1)), applying that to a (2(k+1))n matrix takes just `6kknflops using rotations, but approximately8kk*n` flops using gemm.

Naively applying the rotations using a triple for loop and rot, achieves about 3 GFlop/s when applying from the left, and 10 GFlops/s when applying from the right. This implementation achieves about 40 GFlop/s in double precision for both sides on my laptop (intel i7-1255U), which is very close to the serial gemm rate (57 GFlops). Also note that the theoretical peak of this routine is only 75% of the peak gemm rate, because the ratio of additions and multiplications is not 1.

Final note

I expect (or hope) a lot of people will have comments on whether we should use C++ and on my implementation. To keep the discussions easy to follow, I suggest that conceptual comments on whether or not we should use C++ are directed to issue #998, comments on the specific way I introduced C++ and my implementation of the algorithm should be posted here.

christoph-conrads commented 4 months ago

comments on the specific way I introduced C++ and my implementation of the algorithm should be posted here.

C++11 introduced the auto keyword. Throughout the code you use the C++98 pattern though, e.g.,

const idx_t n_timings = 100;

In C++11 style I would write:

const auto n_timings = idx_t{100};

The C++11 style avoids all coercions and non-representable values cause compiler errors. With C++98 style non-representable value are silently truncated/rounded/etc; GCC 12 with -Wextra -Wall prints no warnings whereas Clang 14 prints warnings when this happens. For this reason and due to the more uniform syntax in combination with the using keyword (using UC = unsigned char;) I prefer the C++11 style. [The C++11 style does not work though when the type name contains a space like unsigned char.]

What are your thoughts about this?

christoph-conrads commented 4 months ago

comments on the specific way I introduced C++ and my implementation of the algorithm should be posted here.

I can'tquite wrap my head around the file names, specifically the repeated cpp in src/lapack_cpp/rot_cpp.cpp and the mixed language indicator in src/lapack_c/rot_c.f90.

I drafted a CMake build in christoph-conrads/introduce-cpp that you may be interested in.

thijssteel commented 4 months ago

I can'tquite wrap my head around the file names, specifically the repeated cpp in src/lapack_cpp/rot_cpp.cpp and the mixed language indicator in src/lapack_c/rot_c.f90

The file names are mostly to make them unique, so that someone exporting all the .o files to a single obj folder does not run into issues.

The lapack_c defines all the c interfaces, it just happens to be a lot easier to define those interfaces in fortran than in c. But perhaps the name can be improved.

thijssteel commented 4 months ago

C++11 introduced the auto keyword. Throughout the code you use the C++98 pattern though

I was not familiar with that style. My opinion on it is mixed. I can try to start using it, but i worry about enforcing language specific things on that level may waste a lot of time. I also feel that way about move vs refernce. On the other hand, if we start from a good example, it may not be a large cost

thijssteel commented 2 months ago

I've found a pretty significant issue with this PR. I rely on template instantiation in the cpp files, but that may fail if the original function was inlined. I am not entirely sure how to get around that.