Flaky CI tests with Scipy 1.11.0 on Windows Python 3.11

jakelishman commented 1 year ago

Environment

Qiskit Terra version: main
Python version: Python 3.11
Operating system: Windows

What is happening?

CI is flaky on Windows / Python 3.11 since the release of Scipy 1.11.

How can we reproduce the issue?

Example CI traceback:

test.python.providers.test_backend_v2.TestBackendV2.test_5q_ghz_22_ecr_level_3_bidirectional_False
--------------------------------------------------------------------------------------------------

Captured traceback:
~~~~~~~~~~~~~~~~~~~
    Traceback (most recent call last):

      File "D:\a\1\s\test-job\Lib\site-packages\ddt.py", line 220, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^

      File "D:\a\1\s\test\python\providers\test_backend_v2.py", line 113, in test_5q_ghz
    tqc = transpile(qc, backend, optimization_level=opt_level)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

      File "D:\a\1\s\qiskit\compiler\transpiler.py", line 375, in transpile
    _serial_transpile_circuit(

      File "D:\a\1\s\qiskit\compiler\transpiler.py", line 457, in _serial_transpile_circuit
    result = pass_manager.run(circuit, callback=callback, output_name=output_name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

      File "D:\a\1\s\qiskit\transpiler\passmanager.py", line 428, in run
    return super().run(circuits, output_name, callback)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

      File "D:\a\1\s\qiskit\transpiler\passmanager.py", line 447, in wrapper
    return meth(*meth_args, **meth_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

      File "D:\a\1\s\qiskit\transpiler\passmanager.py", line 154, in run
    return super().run(
           ^^^^^^^^^^^^

      File "D:\a\1\s\qiskit\passmanager\passmanager.py", line 225, in run
    out_program = self._run_single_circuit(in_programs[0], callback, **metadata)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

      File "D:\a\1\s\qiskit\transpiler\passmanager.py", line 173, in _run_single_circuit
    out_program = pass_runner.run(input_program, callback=callback, **metadata)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

      File "D:\a\1\s\qiskit\transpiler\runningpassmanager.py", line 226, in wrapper
    return meth(*meth_args, **meth_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

      File "D:\a\1\s\qiskit\transpiler\runningpassmanager.py", line 148, in run
    return super().run(
           ^^^^^^^^^^^^

      File "D:\a\1\s\qiskit\passmanager\passrunner.py", line 235, in run
    passmanager_ir = self._run_pass_generic(
                     ^^^^^^^^^^^^^^^^^^^^^^^

      File "D:\a\1\s\qiskit\passmanager\passrunner.py", line 190, in _run_pass_generic
    passmanager_ir = self._run_pass_generic(
                     ^^^^^^^^^^^^^^^^^^^^^^^

      File "D:\a\1\s\qiskit\passmanager\passrunner.py", line 151, in _run_pass_generic
    passmanager_ir = self._run_base_pass(
                     ^^^^^^^^^^^^^^^^^^^^

      File "D:\a\1\s\qiskit\transpiler\runningpassmanager.py", line 177, in _run_base_pass
    new_dag = pass_.run(passmanager_ir)
              ^^^^^^^^^^^^^^^^^^^^^^^^^

      File "D:\a\1\s\qiskit\transpiler\passes\optimization\consolidate_blocks.py", line 123, in run
    or self.decomposer.num_basis_gates(unitary) < basis_count
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

      File "D:\a\1\s\qiskit\quantum_info\synthesis\two_qubit_decompose.py", line 1414, in num_basis_gates
    a, b, c = weyl_coordinates(unitary)[:]
              ^^^^^^^^^^^^^^^^^^^^^^^^^

      File "D:\a\1\s\qiskit\quantum_info\synthesis\weyl.py", line 65, in weyl_coordinates
    D = la.eigvals(Up.T @ Up)
        ^^^^^^^^^^^^^^^^^^^^^

      File "D:\a\1\s\test-job\Lib\site-packages\scipy\linalg\_decomp.py", line 895, in eigvals
    return eig(a, b=b, left=0, right=0, overwrite_a=overwrite_a,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

      File "D:\a\1\s\test-job\Lib\site-packages\scipy\linalg\_decomp.py", line 249, in eig
    _check_info(info, 'eig algorithm (geev)',

      File "D:\a\1\s\test-job\Lib\site-packages\scipy\linalg\_decomp.py", line 1371, in _check_info
    raise LinAlgError(("%s " + positive) % (driver, info,))

    numpy.linalg.LinAlgError: eig algorithm (geev) did not converge (only eigenvalues with order >= 2 have converged)

What should happen?

Reliable CI.

Any suggestions?

I suspect / hope that this is just something like Scipy's internal ARPACK build being done with a new compiler that's tweaked things a little bit. Eigensystem routines are never 100% reliable, and we call them a lot, so it's pretty likely that it just so happens that one matrix in our test suite happens to be a bit unreliable.

In the longer term, we can potentially attempt to stabilise the routines by the insertion of some very slight amounts of noise if the initial decomposition fails, but we'd first need to work out if the Scipy 1.11 is actually significantly less stable than what we've already got; it could just be an unlucky isolated failure.

levbishop commented 1 year ago

Are all the failures in weyl_coordinates? If you can pull out a few failing examples I can try to dive in

jakelishman commented 1 year ago

Yeah, the traceback I linked is the only failing example I've seen, and it only occurs on Windows.

levbishop commented 1 year ago

in particular if it's only weyl_coordinates that fails we could go back to the original algorithm from TwoQubitWeylDecomposition.__new__() that uses eigh and should therefore be much more stable, if possibly a little slower

Or you could dust off your slower but non-randomized algorithm from a couple years ago :-) As I recall it was mainly slow because it was doing a lot of python-level array indexing and manipulation and therefore slow because less numpy-accelerated. I bet a rust implementation of that algorithm would be faster and more robust than both TwoQubitWeylDecomposition randomized hermitian eval/evec solver and the weyl_coordinates non-hermitian eval-only algorithm. :-)

jakelishman commented 1 year ago

Thanks Lev. Yeah, that's a good point - we could definitely just do the eigensystem bits using the Scipy dynamically linked BLAS, then pass the array views down to Rust to sort out the orthonormalisation step. I think for now I'll pin scipy to get CI rolling again, and then I'll try and figure things out better when I've got a bit more of a moment - I've got a bunch of dynamics-circuits-related feature stuff that needs to take priority first.

raynelfss commented 1 year ago

I was able to replicate this issue on my Windows machine by running transpile on a very big circuit (1024 qubits, 128 depth) with optimization level 3. I kept running into the same error occasionally on the following versions of scipy: 1.11.0, 1.10.1, 1.10.0, 1.9.3, and 1.9.2.

 numpy.linalg.LinAlgError: eig algorithm (geev) did not converge (only eigenvalues with order >= 2 have converged)

Side note: It should be noted that there were occasions where this issue didn't happen and that was whenever the ConsolidateBlocks ran a certain decomposer (num_basis_gates):

This happens whenever I installed a different version of scipy without restarting the kernel and it slows down the pass considerably. After restarting the environment/kernel, the error would still happen.

jakelishman commented 1 year ago

Thanks Ray, that's (of sorts) good news for the Terra 0.25 release, because it means that the problem was pre-existing, so we're fine to ship with our requirements allowing Scipy 1.11. I can safely remove this from the 0.25 milestone, and I might be able to make that change that Lev's suggested above to hopefully fortify our eigensystem routines a little bit. It's a little easier now that we've built up a lot more infrastructure in Rust around accelerated compiled routines.

raynelfss commented 1 year ago

I also found this error happening on Windows with Python 3.10 but only with any version with the 1.11 label. It appears that this error happens whenever a very large circuit is processed, but it is particularly more sensitive with anything in the 1.11 version of Scipy. I was able to transpile circuits with 512 qubits and a depth of 64 with 1.10.1, 1.10.0, and 1.9.3 without any errors, but not with 1.11.1 and 1.11.0.

For now, I can only recommend we run CI on Windows machines with Python<=3.10 and with scipy<=1.10.1.

Qiskit / qiskit