Reference-LAPACK / lapack

LAPACK development repository
Other
1.46k stars 430 forks source link

xLARFGP unnecessarily overflows #938

Open christoph-conrads opened 8 months ago

christoph-conrads commented 8 months ago

Description

xLARFGP computes a Householder reflector. Given an input vector x of dimension m, xLARFGP computes H = I - β vv^* such that

934 attempted to fix an overflow when 0 ≪ ‖x(2:m)‖ ≪ x(1) ≪ 1 by settings x(2:m) to zero under certain conditions. This does not resolve the issue. For example, let m = 2 and x = [ε^4, 2 ε^5], then an overflow takes place in the following code when 1 / ALPHA is computed:

https://github.com/Reference-LAPACK/lapack/blob/ae9c818530c6a63cf82f8829d53e8db6b43c40a7/SRC/slarfgp.f#L224

From the point of view of mathematics, this line should simply compute

(See, e.g., Matrix Computations, 4th edition, Algorithm 5.1.1). Obviously the divisions cannot overflow and now the problem source becomes evident: the computation of the reciprocal of ALPHA. The use of, e.g., SLASCL in the case of INCX = 1 resolves the problem.

Minimal example demonstrating the problem (compile with cc -Wextra -Wall -std=c99 -pedantic -llapack -lm):

#include <math.h>
#include <stddef.h>
#include <stdio.h>

typedef int lapack_int;

void slarfgp_(
    lapack_int* n,
    float* alpha,
    float* x,
    lapack_int* incx,
    float* tau);

#define N 2

int main()
{
    float nan = NAN;
    float eps = ldexpf(1.0f, -23);
    lapack_int n = N;
    float x[N] = { powf(eps, 4.0f), 2.0f * powf(eps, 5.0f) };
    lapack_int incx = 1;
    float tau = nan;

    printf("input=[%.2e; %.2e]\n", x[0], x[1]);

    slarfgp_(&n, x + 0, x + 1, &incx, &tau);

    printf("output=[%.2e; %.2e]\n", x[0], x[1]);
    printf("tau=%.2e\n", tau);

    if(isinf(x[1])) {
        fprintf(stderr, "x[1] is infinite!\n");
    }
}

Checklist

langou commented 8 months ago

Hi @christoph-conrads, would SRC/srscl.f do the job too? @langou

langou commented 8 months ago

I did some digging and the issue with LARFGP are mentioned in https://www.netlib.org/lapack/bug_list.html#_strong_span_class_green_bug0037_span_strong_xlarfp_and_scaling which reads:

bug0037 :: xLARFP and scaling CORRECTED - see svn 754, 755 bug report sent by Pat Quillen on Wed 29 Jul 2009 to "lapack@cs.utk.edu". "xLARFP and scaling"

And then

bugXXXX Revert DLARFG to 3.1.0, move DLARFG to DLARFP and improved DLARFP CORRECTED - see svn (r706, r707, r708, r710, r712, r713) + (r755) Bug reported by Pat Quillen (MathWorks) The Householder reflector generation routine dlarfg (for double precision) has been changed in LAPACK-3.2 by dlarfp to generate nonnegative diagonal elements (on the diagonal of the R factor of a QR factorization). There is an issue in the dlarfp subroutine in LAPACK-3.2. The bug report is from Pat Quillen from MathWorks, have participated in the bug fix: Pat Quillen, Jim Demmel, Sven Hammarling, Mark Hoemmen, Guillaume Revy, William Kahan, Julie Langou and Julien Langou. Extensive work by Jim Demmel, Guillaume Revy, and William Kahan have revealed that the new Householder reflector that guarantees a positive diagonal of R in QR has two apparently unfixable drawbacks compared to the old reflector: 1) the computed reflector can be less orthogonal by a factor of about 4, (Jim, Guillaume and William are not sure why.) 2) this new routine is intrinsically more sensitive to over/underflow, this is now well understood, details are skipped here. None of these issues arise with the conventional Householder reflector. So the decision is to

  1. Fix as much as possible the xLARFP routines. This was corrected by Julien (see (r706, r707, r708, r710, r712, r713)) based on Jim's idea.
  2. Back out the changes in LAPACK to go back to using the old reflector (xlarfg) everywhere (e.g. in xGEQRF, xGEQR2, etc.).
  3. Keep the change that saves flops when u has lots of trailing zeros, since this is independently a good idea.
  4. Change the name from xLARFP to xLARFGP and introduce new QR routines (xGEQRFP and xGEQR2P) that uses xlarfp to guarantee a new positive diagonal of R, for who needs it.
  5. This requires to adapt the test suites
  6. Note: only the QR routines have an xGEQRFP. While it seems that quite a few users care for QR factorization with nonnegative diagonal R. Noone seemed to care for this feature on LQ, QL or RQ. So we have left out these routines.
  7. Steps 2 to 6 were done during r755 by Julie.
christoph-conrads commented 8 months ago

Thank you for providing this reference. Apparently the changes to xLARFGP are on target whereas the changes to xORBDB5 and xORBDB6 from #930 are just working around the problem.

2) this new routine is intrinsically more sensitive to over/underflow, this is now well understood, details are skipped here.

After reading the linked messages, reading §2 in LAWN 203 where xLARFGP is introduced, and reviewing my test data, I claim that the problem is caused by the current implementation handling only some of the possible under- and overflows.

For example, in the first message in this issue the expression ONE / ALPHA is observed to overflow. The suggested fix SLASCL( CFROM=ALPHA, CTO=ONE, X ) does not work because ALPHA is subnormal in my test case; the computed vector entries are off by factor two. The problem could be avoided by computing SLASCL( CFROM=XNORM, CTO=-ALPHA/XNORM, X ) and SLASCL( CFROM=ALPHA, CTO=ONE, X ), respectively, depending on which branch is taken in line 192.

To resolve this issue, I will go through the xLARFGP code line by line and check if one can trigger over- or underflows. Luckily dimension-two inputs suffice to trigger these problems.

On the upside I previously claimed that SLASCL can replace calls to SSCAL only if INCX = 1. This is not true:

christoph-conrads commented 8 months ago

Computations with Householder reflectors by xLARFGP are only conditionally backward stable. The problem is not the computation of the reflector itself but the entries large in modulus it can contain. Those large entries in turn cause the loss of significant digits when the reflector is applied (e.g., with xLARF). Consequently, the problems with xLARFGP are not necessarily evident by looking only at the xLARFGP output.

The text is structured as follows:

tl;Dr Householder reflectors computed by xLARFGP can contain values with very large modulus when x(1) > 0 and x(1) ≈ ‖x‖. This can cause overflows during the reflector computation or later a loss of accuracy when the reflector is used by.

Notation

Given a vector x, xLARFGP computes a vector v subject to v(1) = 1 and a scalar τ such that v is an elementary Householder reflector with I - τ v v^* x = β e₁, where β = ‖x‖. In contrast to xLARFG (note the missing P in the subroutine name) β is constrained to be nonnegative.

A 2x2 Example

Consider the following matrix in single precision:

A = [10/11 ε⁴, 1]
    [12/10 ε⁵, 0]

This example was derived from the failing test that led to the creation of this issue. The constants 10/11 and 12/10 are tuned values from the test data; their most important properties are

The QR decomposition computed by SGEQRF with SGEQR2 patched to use SLARFGP is as follows:

+1.8358945e-28 -1.0000000e+00
-1.3981013e+07 +1.4310353e-07

Note the very large value in modulus in the first column; this is the v(2) entry of the first reflector. The Householder matrix assembled by SORGQR from this data looks as follows:

+1.0000000e+00 -1.4310353e-07
+1.4310353e-07 +1.0007324e+00

There are two things of note in this matrix that should be orthogonal:

The orthogonal matrix computed by SGEQRF, SORGQR, and SLARFGP is inaccurate.

An Explanation

The computation of Householder matrices is in general known to be backward stable. This section will discuss which assumptions made in the numerical linear algebra literature are not met by a Householder reflector computed with a nonnegative β. Specifically, I will refer to the proofs in "Accuracy and Stability of Numerical Algorithms", 2nd Edition, 2002, by Nicholas J. Higham (ASNA for short) in §19.3 "Error Analysis of Householder Computations".

There exist three points of divergence when ε must be nonnegative.

  1. β is allowed to be negative whereas xLARFGP constrains it to be nonnegative.
  2. The norm of the reflector v is assumed to be √2 (see the paragraph after Lemma 19.1) whereas the LAPACK implementation constructs a reflector with v(1) = 1.
  3. The reflector v is applied as (√τ v)(√τ v)^* whereas SLARF computes y - τ (v (v^* y)).

The norm of the reflector is significant in Lemma 19.2; this lemma proves a small backward error when (I - τ v v^*) y is computed for an arbitrary vector y.

Let us first consider the case of xLARFG:

  1. β can have any sign.
  2. The norm of the reflector is at least one and at most √2. This follows from the choice of sign. With this norm the assumptions of Lemma 19.2 can be assumed to be fulfilled.
  3. When β can have any sign, then 1 ≤ τ ≤ 2. Thus, thus the computation of y - τ (v (v^* y)) by SLARF should not deliver results that differ significantly from the computation y - (√τ v)(√τ v)^* y in ASNA.

In summary, there is a close match between the assumptions in ASNA and the actual LAPACK implementation.

Next, we examine the behavior with xLARFPG when x(1) is positive (if x(1) is negative, then it behaves identically to xLARFG and is numerically backward stable):

  1. β is constrained to be nonnegative.
  2. The norm of the reflector v is nearly unconstrained because x(1) - ‖x‖ can be very small. Thus the result of v(i) = x(i) / (x(1) - ‖x‖) can be very large in modulus. This implies large intermediate values in computations and a backward error that is much larger than the one calculated in ASNA Lemma 19.2.
  3. SLARF computes x - τ (v (v^* x)) and the modulus τ is nearly unconstrained again.

In summary, the proofs concerning the backward stability of Householder matrix-related calculations in ASNA cannot be used when reflectors are computed by xLARFGP and x(1) > 0 and x(1) ≈ ‖x‖.

Suggested Fixes

In general, I do not see how to fix xLARFGP. Possible work-arounds are:

  1. Always use the safe sign choice and hide a sign bit for τ somewhere in the data as suggested by Pat Quillen.
  2. Avoid critical input when the call sites are controlled by the programmer.

Appendix

The Python program below computes the QR decomposition of a 2x2 matrix and explicitly forms the Householder matrix Afterwards. The purpose of the program is to be able to play with the values in the first matrix column after replacing the call to SLARFG with a call to SLARFGP in SGEQR2.

Usage: python3 sgeqr-demo.py or python3 sgeqr-demo.py /path/to/liblapack.so.

#!/usr/bin/python3

import array
import ctypes
import ctypes.util
from ctypes import byref, c_char, c_float, c_int32, c_void_p, POINTER
import math
import sys

def main():
    if len(sys.argv) not in [1, 2]:
        return "usage: python3 %s [path to LAPACK library]" % (sys.argv[0], )

    if len(sys.argv) >= 2:
        lapack_library = sys.argv[1]
    else:
        lapack_library = ctypes.util.find_library("lapack")

    print("name of LAPACK library:", lapack_library)

    lapack = ctypes.CDLL(lapack_library, use_errno=True)

    # an alias for the _F_ortran integer type
    f_int = ctypes.c_int32
    lapack.sgeqrf_.restype = None
    lapack.sgeqrf_.argtypes = [
        POINTER(f_int),
        POINTER(f_int),
        # A
        POINTER(c_float),
        POINTER(f_int),
        POINTER(c_float),
        POINTER(c_float),
        POINTER(f_int),
        POINTER(f_int),
    ]

    lapack.sorgqr_.restype = None
    lapack.sorgqr_.argtypes = [
        POINTER(f_int),
        POINTER(f_int),
        POINTER(f_int),
        POINTER(c_float),
        POINTER(f_int),
        POINTER(c_float),
        POINTER(c_float),
        POINTER(f_int),
        POINTER(f_int),
    ]

    eps = 2**-23
    nan = float("NaN")

    m = 2
    n = 2
    x0 = 10/11 * eps
    x1 = 12/10 * eps * x0

    assert abs(x1 / x0) > eps

    a = array.array("f", [x0, x1, -1, 0])
    lda = m

    assert len(a) == m * n

    tau = array.array("f", [nan] * min(m, n))
    work = array.array("f", [nan] * 256)
    info = f_int(0)

    intref = lambda n: byref(f_int(n))
    f32ref = lambda array: (c_float * len(array)).from_buffer(array)
    lapack.sgeqrf_(
        intref(m),
        intref(n),
        f32ref(a),
        intref(lda),
        f32ref(tau),
        f32ref(work),
        intref(len(work)),
        byref(info),
    )

    assert info.value == 0

    print("tau +%8.2e %+8.2e" % (tau[0], tau[1]))
    print()
    print("a %+13.7e %+13.7e" % (a[0], a[2]))
    print("a %+13.7e %+13.7e" % (a[1], a[3]))

    lapack.sorgqr_(
        intref(m),
        intref(n),
        intref(n),
        f32ref(a),
        intref(lda),
        f32ref(tau),
        f32ref(work),
        intref(len(work)),
        byref(info),
    )

    assert info.value == 0

    print()
    print("q %+13.7e %+13.7e" % (a[0], a[2]))
    print("q %+13.7e %+13.7e" % (a[1], a[3]))
    print()
    print("delta = %+8.2e\n" % ((a[3] - 1) / eps,))

if __name__ == "__main__":
    sys.exit(main())
langou commented 8 months ago

Thanks for all this work Christoph, and taking the time to explain.

Three follow-up question/comment's.

(1) Would another representation of the Householder reflector be better? So right now LAPACK decides that the Householder vector is such that the top element is always a 1, but other choices are possible. See for example, The Computation of Elementary Unit Matrix by Rich Lehoucq, ACM Transactions on Mathematical Software, Volume 22, Issue 4, pp 393–400, https://doi.org/10.1145/235815.235817, where Rich Lehoucq explains four previous representations (1) Wilkinson's approach, which is also used in EISPACK, (2) LINPACK approach, (3) NAG approach, (4) LAPACK approach. Or maybe another approach would help. I agree that this does not seem to help, but since I am not sure I wanted to ask.

(2) I would be in favor of spending some time to implement Always use the safe sign choice and hiding a sign bit for τ somewhere in the data as suggested by Pat Quillen. How does that work for complex numbers though? It seems that for complex numbers we need more than a bit. Shall we just store an extra vectors of scalars (all of modulus 1, so +/-1 in real and e^(i.theta) in complex) along τ then?

(3) To some extent these signs are already in the output of GEQRF (or LARFG) . . . they are on the diagonal of R. So we could return the signs on the diagonal of R, being understood that of course the diagonal entries of R are real nonnegative. And we are only storing the signs on the diagonal of R to save space. This sounds like a bad idea. And I would more be in favor to store an extra array for these signs, and then have the diagonal of R real.

We can add this on the wish list.

christoph-conrads commented 8 months ago

(1) Frankly I do not want to look at alternative representations now and finish #406 before involving myself in something else. Besides the current representation has very nice properties (the first reflector element is known to be one which allows for compact QR/QL/... representations, the reflector norm is close to one, and 1 ≤ τ ≤ 2).

(2) The issue when computing a reflector Hx = β e₁ is not computing a real β but the choice of the sign. (In fact, LAPACK computed real β until version 3.1.1 according to LAWN 203.) If it was not for the fixed computation of I - τ v v^*, the issue would be trivial to fix, e.g., by

~edit Now I see what you mean. τ is complex in the complex case. I need to think about this. /edit~

In the complex case, 1 ≤ Re(τ) ≤ 2 and only the sign bit of the real part of τ would need to be modified.

[The preceding paragraph mentions matrix multiplications from the left or the right. The idea is that x is the first column of a matrix whose columns are about to be orthonormalized.]

In the current code, τ never has its sign bit set and one could exploit this unused bit of information to request the computation of |τ| v v^* - I in the SLARF* subroutines. This would work as follows:

Note that this suggestion requires a float representation where +0 and -0 can be distinguished. Checking for τ < 0 is insufficient (think of x = [-1, 0]).

The advantages of this approach are as follows:

Disadvantages:

(3) The most important call site for xLARFGP are the functions for the CS decomposition and they do not compute QR decompositions.

christoph-conrads commented 8 months ago
  • If the xLARF* implementations were not updated to handle τ with a sign bit set, then they will compute garbage. [...]

Two scenarios come to mind how this might happen:

The change would only affect users of xLARFGP though.

christoph-conrads commented 7 months ago

A 3x3 Example with Rapid Divergence

The stability proofs for the computations of QR decomposition by means of Householder reflections do not apply when the sign choice by slarfgP is used. In this post, I demonstrate the results with specially a crafted matrix A and a unit-norm vector x such that computations involving the orthogonal factor of the QR decomposition diverge much more rapidly during the computations of matrix-vector products Qᵏ x in finite-precision arithmetic if slarfgP is employed instead of slarfg with its safe sign choice, where Q is the orthogonal factor of a QR decomposition based on Householder reflections of A. The effect is present for both the factored representation and the explicitly formed representation of the orthogonal factor.

Q will denote the orthogonal factor of a QR decomposition in its factorized form whereas P will denote the same factor after a call to xORGQR. The index + denotes matrices computed by slarfgP. The Euclidean norm is used throughout this post.

Reminder: in real arithmetic, it holds that

Maximizing Inaccuracy with slarfgP

Based on my previous posts, inputs x and A were devised such that Q₊ x is as inaccurate as possible. More concretely, each column a that is used for the computation of a Householder reflector must have the following form:

With a lot of trial-and-error, the following 3x3 matrix A and vector x was crafted (NB: crafted, not constructed. A lot of trial-and-error was involved.):

    ε      1            0
8/7·ε^2    ε            0
    ε^3    28/31·ε^2    0

Vector x:

0
-1/7
102 / 101

Growth of Vector Norm

When computing ‖Q^k x‖, the norm grows in every iteration when slarfgP is used (see table below) and the norm grows quickly beyond what is considered an acceptable rounding error. No such effect can be observed with slarfg and for this reason, there is no column for this method in the table below. Obviously, if Q₊ᵏ x grows strictly monotonously in norm, then x = Q₊²ᵏ x cannot hold. Furthermore, the growth in norm is still present though less pronounced after explicitly forming the orthogonal factor P₊.

It holds that |‖Q₊²ᵏ x‖ - 1| ≤ ‖x - Q₊²ᵏ x‖ implying that the result of repeated multiplications is quickly becoming inaccurate.

k ‖Q₊ᵏ x‖ - 1 ‖P₊ᵏ x‖ - 1
1
2
3
4 11ε
8 24ε 16ε
16 48ε 33ε
32 93ε 66ε
64 183ε 132ε

Divergence of Multiplication Result

In the next experiment ‖x - Qᵏ x‖, ‖x - Qᵏ x‖ and ‖x - 1/l P₊ᵏ x‖ are computed for even k, where l = ‖Q₊ᵏ x‖ or l = ‖P₊ᵏ x‖, respectively. The previous experiment showed that when slarfgP is employed, l will be different from one. This paragraph demonstrates that in addition to a growth in norm of the multiplication result, the computed result is increasingly inaccurate when slarfgP is used (even after normalization). The effect is again less pronounced when the orthogonal factor P₊ is explicitly formed.

The table below does not contain a column for results involving P (the explicitly formed orthogonal factor computed by slarfg) because the results with P and Q are indistinguishable at the accuracy level of the table.

k ‖x - Qᵏ x‖ ‖x - 1/l Q₊ᵏ x‖ ‖x - 1/l P₊ᵏ x‖
2 0.3ε 1.0ε 0.8ε
4 0.6ε 2.0ε 1.4ε
8 1.3ε 3.6ε 2.8ε
16 2.6ε 7.3ε 5.4ε
32 5.1ε 15.1ε 10.9ε
64 10.2ε 30.6ε 21.6ε

Appendix: QR decomposition with slarfgP

diff --git a/SRC/sgeqr2.f b/SRC/sgeqr2.f
index 3a78733b7..928fa90b1 100644
--- a/SRC/sgeqr2.f
+++ b/SRC/sgeqr2.f
@@ -150,7 +150,7 @@
       REAL               AII
 *     ..
 *     .. External Subroutines ..
-      EXTERNAL           SLARF, SLARFG, XERBLA
+      EXTERNAL           SLARF, SLARFG, SLARFGP, XERBLA
 *     ..
 *     .. Intrinsic Functions ..
       INTRINSIC          MAX, MIN
@@ -178,7 +178,7 @@
 *
 *        Generate elementary reflector H(i) to annihilate A(i+1:m,i)
 *
-         CALL SLARFG( M-I+1, A( I, I ), A( MIN( I+1, M ), I ), 1,
+         CALL SLARFGP( M-I+1, A( I, I ), A( MIN( I+1, M ), I ), 1,
      $                TAU( I ) )
          IF( I.LT.N ) THEN
 *

Appendix: Python Code

Run the code with python3 /path/to/lapack/library <k>.

#!/usr/bin/python3

import array
import copy
import ctypes
import ctypes.util
from ctypes import byref, c_char, c_double, c_float, c_int32, c_void_p, POINTER
from ctypes import c_size_t
import math
import sys

def print_matrix(m: int, n: int, a):
    lda = m
    for i in range(m):
        for j in range(n):
            index = j * lda + i
            print("%+15.9e" % (a[index], ), end=" " if j + 1 < n else "")
        print()

def main():
    if len(sys.argv) not in [1, 2, 3]:
        return "usage: python3 %s [path to LAPACK library] [k]" % (
            sys.argv[0], )

    if len(sys.argv) >= 2:
        lapack_library = sys.argv[1]
    else:
        lapack_library = "/tmp/tmp.yVyeY13Yna/lib/liblapack.so.3"

    k = int(sys.argv[2]) if len(sys.argv) >= 3 else 4

    if k < 0:
        return "the iteration count k must be positive, got %d" % (k,)

    print("name of LAPACK library:", lapack_library)
    print("iteration count:", k)
    print()

    lapack = ctypes.CDLL(lapack_library, use_errno=True)

    # an alias for the _F_ortran integer type
    f_int = ctypes.c_int32

    lapack.sgemv_.restype = None
    lapack.sgemv_.argtypes = [
        POINTER(c_char),
        POINTER(f_int),
        POINTER(f_int),
        # alpha
        POINTER(c_float),
        # A
        POINTER(c_float),
        POINTER(f_int),
        # x
        POINTER(c_float),
        POINTER(f_int),
        # beta
        POINTER(c_float),
        # y
        POINTER(c_float),
        POINTER(f_int),
        POINTER(c_size_t),
    ]

    lapack.sgeqrf_.restype = None
    lapack.sgeqrf_.argtypes = [
        POINTER(f_int),
        POINTER(f_int),
        # A
        POINTER(c_float),
        POINTER(f_int),
        POINTER(c_float),
        POINTER(c_float),
        POINTER(f_int),
        POINTER(f_int),
    ]

    lapack.snrm2_.restype = ctypes.c_float
    lapack.snrm2_.argtypes = [POINTER(f_int), POINTER(c_float), POINTER(f_int)]

    lapack.sorgqr_.restype = None
    lapack.sorgqr_.argtypes = [
        POINTER(f_int),
        POINTER(f_int),
        POINTER(f_int),
        POINTER(c_float),
        POINTER(f_int),
        POINTER(c_float),
        POINTER(c_float),
        POINTER(f_int),
        POINTER(f_int),
    ]

    lapack.sormqr_.restype = None
    lapack.sormqr_.argtypes = [
        POINTER(c_char),
        POINTER(c_char),
        POINTER(f_int),
        POINTER(f_int),
        POINTER(f_int),
        # A
        POINTER(c_float),
        POINTER(f_int),
        # tau
        POINTER(c_float),
        # C
        POINTER(c_float),
        POINTER(f_int),
        # work
        POINTER(c_float),
        POINTER(f_int),
        POINTER(f_int),
        POINTER(c_size_t),
        POINTER(c_size_t),
    ]

    lapack.dgeqrf_.restype = None
    lapack.dgeqrf_.argtypes = [
        POINTER(f_int),
        POINTER(f_int),
        # A
        POINTER(c_double),
        POINTER(f_int),
        POINTER(c_double),
        POINTER(c_double),
        POINTER(f_int),
        POINTER(f_int),
    ]

    lapack.dorgqr_.restype = None
    lapack.dorgqr_.argtypes = [
        POINTER(f_int),
        POINTER(f_int),
        POINTER(f_int),
        POINTER(c_double),
        POINTER(f_int),
        POINTER(c_double),
        POINTER(c_double),
        POINTER(f_int),
        POINTER(f_int),
    ]

    eps32 = 2**-23
    nan = float("NaN")

    m = 3
    n = m

    lda = m
    a = array.array("f", [0] * (lda * n))

    a[0 * lda + 0] = eps32
    a[0 * lda + 1] = 8 / 7 * eps32 * a[0]
    a[0 * lda + 2] = eps32**2 * a[0]

    a_10 = 1
    a_11 = eps32 * a_10
    a_12 = 28 / 31 * eps32 * a_11
    a[1 * lda + 0] = a_10
    a[1 * lda + 1] = a_11
    a[1 * lda + 2] = a_12

    assert len(a) == lda * n

    print("A")
    print_matrix(lda, n, a)
    print()

    tau = array.array("f", [nan] * min(m, n))
    work = array.array("f", [nan] * 256)
    info = f_int(0)

    intref = lambda n: byref(f_int(n))
    f32ref = lambda array: (c_float * len(array)).from_buffer(array)
    lapack.sgeqrf_(
        intref(m),
        intref(n),
        f32ref(a),
        intref(lda),
        f32ref(tau),
        f32ref(work),
        intref(len(work)),
        byref(info),
    )

    assert info.value == 0

    print("TAU")
    for (i, t) in enumerate(tau):
        print("%2d %+15.9e" % (i, t))
    print()

    print("A after xGEQR")
    print_matrix(lda, n, a)

    def multiply(a, lda, x):
        y = array.array("f", [nan] * n)
        alpha = c_float(1)
        beta = c_float(0)
        lapack.sgemv_(
            b'N',
            intref(m),
            intref(n),
            byref(c_float(1)),
            f32ref(a),
            intref(lda),
            f32ref(x),
            intref(1),
            byref(beta),
            f32ref(y),
            intref(1),
            byref(c_size_t(1)),
        )
        return y

    def norm(xs):
        return lapack.snrm2_(intref(len(xs)), f32ref(xs), intref(1))

    def normalize(xs):
        ys = copy.deepcopy(xs)
        norm_x = norm(xs)
        for i in range(len(xs)):
            ys[i] = xs[i] / norm_x
        return ys

    def compute_delta(xs):
        assert len(xs) == len(x_0)

        return array.array("f", [xs[i] - x_0[i] for i in range(len(x_0))])

    x_0 = normalize(array.array("f", [0, -1 / 7, 102 / 101]))
    x = copy.deepcopy(x_0)

    assert len(x) == n

    for _ in range(k):
        lapack.sormqr_(
            b'L',
            b'N',
            intref(m),
            intref(1),
            intref(m),
            f32ref(a),
            intref(lda),
            f32ref(tau),
            f32ref(x),
            intref(n),
            f32ref(work),
            intref(len(work)),
            byref(info),
            byref(c_size_t(1)),
            byref(c_size_t(1)),
        )

        assert info.value == 0

    x_normed = normalize(x)

    p = copy.deepcopy(a)
    ldp = lda

    lapack.sorgqr_(
        intref(m),
        intref(n),
        intref(n),
        f32ref(p),
        intref(ldp),
        f32ref(tau),
        f32ref(work),
        intref(len(work)),
        byref(info),
    )

    x_gemv = copy.deepcopy(x_0)

    for _ in range(k):
        x_gemv = multiply(p, ldp, x_gemv)

    x_normed_v = normalize(x_gemv)

    print()
    print("        x", " ".join(["%+15.9e" % (x, ) for x in x_0]))
    print("    Q^%d x" % (k, ), " ".join(["%+15.9e" % (x, ) for x in x]))
    print("1/l Q^%d x" % (k, ),
          " ".join(["%+15.9e" % (x, ) for x in x_normed]))
    print("    P^%d x" % (k, ), " ".join(["%+15.9e" % (x, ) for x in x]))
    print("1/l P^%d x" % (k, ),
          " ".join(["%+15.9e" % (x, ) for x in x_normed_v]))
    print()
    print("norm(Q^%d x) = 1 + %.2f eps" % (k, (norm(x) - 1) / eps32))
    print("norm(P^%d x) = 1 + %.2f eps" % (k, (norm(x_gemv) - 1) / eps32))
    print()

    if k % 2 == 0:
        delta = compute_delta(x)
        delta_normed = compute_delta(x_normed)
        delta_gemv = compute_delta(x_gemv)
        delta_normed_v = compute_delta(x_normed_v)

        print("norm(x - Q^%d x)     = %9.3e eps" % (k, norm(delta) / eps32))
        print("norm(x - 1/l Q^%d x) = %9.3e eps" %
              (k, norm(delta_normed) / eps32))
        print("norm(x - P^%d x)     = %9.3e eps" %
              (k, norm(delta_gemv) / eps32))
        print("norm(x - 1/l P^%d x) = %9.3e eps" %
              (k, norm(delta_normed_v) / eps32))

    assert info.value == 0

if __name__ == "__main__":
    sys.exit(main())