Introduce cpp (and also adds lasr3 and steqr3)

thijssteel commented 4 months ago

related to issue #988

This PR introduces some C++ code to LAPACK.

The general design for the C++ code is:

C wrappers on top of existing fortran code
C++ wrappers on top of those C wrappers
C++ code for the new algorithm and its tests (note, those tests are not automated, so definitely need improving)
C wrappers around the new C++ code
Fortran wrappers around those C wrappers so current users can still use a familiar API.

A lot of things are introduced in this PR:

Vector and Matrix classes, these are thin wrappers around a pointer and a size, allowing things like verbose selection of submatrices, columns, ..., all with bounds checks for every step.
LASR3, an algorithm which efficiently applies a sequence of plane rotations to a matrix. (Note, this applies multiple sequences, making it more a level 3 routine than lasr, which is more a level 2 routine, hence the name) This algorithm is described in the paper: "Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance" F. G. Van Zee et al. I have made a modification of this algorithm which packs the matrix A into an aligned workspace for efficiency. This means we also achieve high performance for matrices that are not aligned and even for access patterns that are normally not efficient (applying a sequence from the left on a column major matrix).
STEQR3, a variant of STEQR (QR algorithm for symmetric eigenvalues) which uses LASR3 to efficiently apply the rotations.

Directory structure

This PR adds some new folders:

- lapack_cpp
   - example
   - include
      - lapack_c
      - lapack_cpp 
   - src
      - lapack_c
      - lapack_cpp
      - lapack_fortran

Because we implement the cpp code using templates, they are all defined in headers, which reside in the include folder. The src folder mostly contains wrappers.

There are three main routines taking part in this PR: lasr3, rot, and lartg.

lasr3 is written in C++, the main code is in include/lapack_cpp/lapack/lasr3.hpp. The wrappers to call this function from c are defined in include/lapack_c/lasr3.h and src/lapack_c/lasr3.cpp. The fortran wrappers are defined in src/lapack_fortran/<sdcz>lasr3.f90
rot has a full C++ implementation, defined in include/lapack_cpp/blas/rot.hpp, but this is just the default implementation. Through c wrappers defined in include/lapack_c/rot.h, src/lapack_c/rot.cpp, and src/lapack_cpp/rot.cpp, it actually uses the fortran implementation for supported types if USE_FORTRAN_BLAS is set.
lartg does not have a full C++ implementation. The header defined in include/lapack_cpp/lapack/lartg.hpp is empty. The actual implementation is provided through the interfaces with fortran, defined in src/lapack_cpp/lartg.cpp, src/lapack_c/rot.f90, and include/lapack_c/rot.h

This directory structure is not necessarily optimal. Perhaps it is better to put the headers and source files in the same directory for easy navigation (it is trivial for build tools to copy them to an include directory)

Building

Currently, the example can be built using GNU Make, but ideally, there would be a proper build which also includes the newly added lasr3 algorithm in the lapack binary, both using GNU Make and Cmake.

Workspace

An inevitable discussion for any LAPACK variant is workspace. I chose the following design:

Algorithms do not internally allocate memory. While a few allocations means nothing compared to the factorization of a large matrix, imagine if every larf call internally allocated a vector.
The fortran interface keeps the old lwork = -1 workspace query design.
The cpp interface has a separate function for the workspace query. For lasr3, this is lasr3_workquery, which returns an int, representing the required workspace size.
The cpp interface stores its workspace in a MemoryBlock, but this argument is optional. If it is not provided, a workspace query is performed, a MemoryBlock is internally allocated and provided to the full routine. By default, this workspace is allocated using aligned_alloc, just for a tiny bit of extra performance.

Matrix class

A matrix/vector is just lightweight wrapper around a pointer and a few integers defining its leading dimension and size. Because it knows its size, it can internally check for out of bounds access, with much more precision than something like valgrind ever could. The following code

real A(50,50)
call srot( 100, A(1, 1), lda, A(2, 1), lda, c, s)

would fail inside srot. Not only could the matrix class have detected such a mistake at its source, it is actually impossible to write the mistake.

Matrix A(50, 50, memoryblock);
rot(A.row(1), A.row(2), c, s);

A question that was posed is: why don't we use mdspan or Eigen? Eigen matrices own their data, making it a bit more difficult to make fortran wrappers. The expression template system and fixed size matrices, while great for many applications are also extra features that would take effort to fully support and I don't think it is appropriate for LAPACK. mdspan is a closer match to our matrix design, but it again has extra features that I don't want to have to support. An mdspan matrix can have a more general layout that just column/row major. Because it is also a thin wrapper around some data, it is trivial to convert between the two matrix types (so long as they have a compatible stride), so users using mdspan should be able to use the cpp interface without much effort.

About optimized BLAS/LAPACK

Linking the C++ code to optimized BLAS or even LAPACK can achieved through the c/fortran wrappers. But vendors could eventually choose to just provide their own binaries for those wrappers, essentially linking our C++ with their C/C++ code, without going through Fortran.

About the algorithm

Why would we care about efficiently applying givens rotations? Essentially, because of eigenvalues. The implicit QR version of the SVD and symmetric QR algorithm do not achieve high flop rates, mainly because of the lack of an efficient method to apply givens rotations. The routine to calculate a Hessenberg-triangular reduction achieves high efficiency by accumulating blocks of rotations into an orthogonal matrix and then using gemm, but that involves extra flops (not just to do the accumulation, but also because the orthogonal matrix for kk rotations, is usually of size (2(k+1))(2(k+1)), applying that to a (2(k+1))n matrix takes just `6kknflops using rotations, but approximately8kk*n` flops using gemm.

Naively applying the rotations using a triple for loop and rot, achieves about 3 GFlop/s when applying from the left, and 10 GFlops/s when applying from the right. This implementation achieves about 40 GFlop/s in double precision for both sides on my laptop (intel i7-1255U), which is very close to the serial gemm rate (57 GFlops). Also note that the theoretical peak of this routine is only 75% of the peak gemm rate, because the ratio of additions and multiplications is not 1.

Final note

I expect (or hope) a lot of people will have comments on whether we should use C++ and on my implementation. To keep the discussions easy to follow, I suggest that conceptual comments on whether or not we should use C++ are directed to issue #998, comments on the specific way I introduced C++ and my implementation of the algorithm should be posted here.

christoph-conrads commented 4 months ago

comments on the specific way I introduced C++ and my implementation of the algorithm should be posted here.

C++11 introduced the auto keyword. Throughout the code you use the C++98 pattern though, e.g.,

const idx_t n_timings = 100;

In C++11 style I would write:

const auto n_timings = idx_t{100};

The C++11 style avoids all coercions and non-representable values cause compiler errors. With C++98 style non-representable value are silently truncated/rounded/etc; GCC 12 with -Wextra -Wall prints no warnings whereas Clang 14 prints warnings when this happens. For this reason and due to the more uniform syntax in combination with the using keyword (using UC = unsigned char;) I prefer the C++11 style. [The C++11 style does not work though when the type name contains a space like unsigned char.]

What are your thoughts about this?

christoph-conrads commented 4 months ago

comments on the specific way I introduced C++ and my implementation of the algorithm should be posted here.

I can'tquite wrap my head around the file names, specifically the repeated cpp in src/lapack_cpp/rot_cpp.cpp and the mixed language indicator in src/lapack_c/rot_c.f90.

I drafted a CMake build in christoph-conrads/introduce-cpp that you may be interested in.

thijssteel commented 4 months ago

I can'tquite wrap my head around the file names, specifically the repeated cpp in src/lapack_cpp/rot_cpp.cpp and the mixed language indicator in src/lapack_c/rot_c.f90

The file names are mostly to make them unique, so that someone exporting all the .o files to a single obj folder does not run into issues.

The lapack_c defines all the c interfaces, it just happens to be a lot easier to define those interfaces in fortran than in c. But perhaps the name can be improved.

thijssteel commented 4 months ago

C++11 introduced the auto keyword. Throughout the code you use the C++98 pattern though

I was not familiar with that style. My opinion on it is mixed. I can try to start using it, but i worry about enforcing language specific things on that level may waste a lot of time. I also feel that way about move vs refernce. On the other hand, if we start from a good example, it may not be a large cost

thijssteel commented 2 months ago

I've found a pretty significant issue with this PR. I rely on template instantiation in the cpp files, but that may fail if the original function was inlined. I am not entirely sure how to get around that.

Reference-LAPACK / lapack