Open jakelishman opened 1 year ago
Are all the failures in weyl_coordinates
?
If you can pull out a few failing examples I can try to dive in
Yeah, the traceback I linked is the only failing example I've seen, and it only occurs on Windows.
in particular if it's only weyl_coordinates
that fails we could go back to the original algorithm from TwoQubitWeylDecomposition.__new__()
that uses eigh
and should therefore be much more stable, if possibly a little slower
Or you could dust off your slower but non-randomized algorithm from a couple years ago :-) As I recall it was mainly slow because it was doing a lot of python-level array indexing and manipulation and therefore slow because less numpy-accelerated. I bet a rust implementation of that algorithm would be faster and more robust than both TwoQubitWeylDecomposition randomized hermitian eval/evec solver and the weyl_coordinates non-hermitian eval-only algorithm. :-)
Thanks Lev. Yeah, that's a good point - we could definitely just do the eigensystem bits using the Scipy dynamically linked BLAS, then pass the array views down to Rust to sort out the orthonormalisation step. I think for now I'll pin scipy to get CI rolling again, and then I'll try and figure things out better when I've got a bit more of a moment - I've got a bunch of dynamics-circuits-related feature stuff that needs to take priority first.
I was able to replicate this issue on my Windows machine by running transpile
on a very big circuit (1024 qubits, 128 depth) with optimization level 3. I kept running into the same error occasionally on the following versions of scipy
: 1.11.0
, 1.10.1
, 1.10.0
, 1.9.3
, and 1.9.2
.
numpy.linalg.LinAlgError: eig algorithm (geev) did not converge (only eigenvalues with order >= 2 have converged)
Side note: It should be noted that there were occasions where this issue didn't happen and that was whenever the ConsolidateBlocks
ran a certain decomposer (num_basis_gates
):
This happens whenever I installed a different version of scipy without restarting the kernel and it slows down the pass considerably. After restarting the environment/kernel, the error would still happen.
Thanks Ray, that's (of sorts) good news for the Terra 0.25 release, because it means that the problem was pre-existing, so we're fine to ship with our requirements allowing Scipy 1.11. I can safely remove this from the 0.25 milestone, and I might be able to make that change that Lev's suggested above to hopefully fortify our eigensystem routines a little bit. It's a little easier now that we've built up a lot more infrastructure in Rust around accelerated compiled routines.
I also found this error happening on Windows with Python 3.10 but only with any version with the 1.11
label. It appears that this error happens whenever a very large circuit is processed, but it is particularly more sensitive with anything in the 1.11
version of Scipy. I was able to transpile circuits with 512 qubits and a depth of 64 with 1.10.1
, 1.10.0
, and 1.9.3
without any errors, but not with 1.11.1
and 1.11.0
.
For now, I can only recommend we run CI on Windows machines with Python<=3.10 and with scipy<=1.10.1
.
Environment
What is happening?
CI is flaky on Windows / Python 3.11 since the release of Scipy 1.11.
How can we reproduce the issue?
Example CI traceback:
What should happen?
Reliable CI.
Any suggestions?
I suspect / hope that this is just something like Scipy's internal ARPACK build being done with a new compiler that's tweaked things a little bit. Eigensystem routines are never 100% reliable, and we call them a lot, so it's pretty likely that it just so happens that one matrix in our test suite happens to be a bit unreliable.
In the longer term, we can potentially attempt to stabilise the routines by the insertion of some very slight amounts of noise if the initial decomposition fails, but we'd first need to work out if the Scipy 1.11 is actually significantly less stable than what we've already got; it could just be an unlucky isolated failure.