Qiskit / qiskit

Qiskit is an open-source SDK for working with quantum computers at the level of extended quantum circuits, operators, and primitives.
https://www.ibm.com/quantum/qiskit
Apache License 2.0
5.04k stars 2.32k forks source link

Speedup sparse matrix construction in SparsePauliOp #8772

Closed chetmurthy closed 4 months ago

chetmurthy commented 1 year ago

What should we add?

Just putting this at the top: link to qrusty: https://github.com/chetmurthy/qrusty

When constructing sparse matrixes from SparsePauliOp, the current code constructs a sparse matrix from each BasePauli (which is a string of form 'IIIXYZIZYIII') and then combines then in linear combination with complex coefficients. This is inefficient and space-wasting on many levels:

  1. the BasePauli construction is Python code, and it's not the fastest.
  2. for large problems, we can tens of thousands of BasePaulis, each of which takes tens of megabytes, so adding them up entails considerable construction of intermediate datastructures that are nearly-immediately discarded.

Concretely, I'm working with some people who have 24-qubit problems with >20k BasePaulis -- the construction problem alone takes effectively interminable time (heck, the 20-qubit problem runs longer than a weekend). Their largest problem (so far) has 92B nonzero complex entries, and another person I'm working with has 27-qubit problems.

There's another way to solve this problem, but let's review how the current solution works:

sumMatrix := empty
for each (basePauli,coeff) in sparsePauliOp:
    spmat := convert basePauli to sparse matrix
    sumMatrix += coeff * spmat

And that convert line can be written as:

  basemat := empty
  for each row in 0..2^#qubits:
    basemat = basemat <append> (row,col, value for the nonzero value in this basePauli)

Viewed this way, the "sumMatrix +=" (when the matrices are represented as sparse matrices) is really an insertion sort: each row of "basemat" is insertion-sorted into "sumMatrix" and if there's a nonzero value there, then it's added.

But we can swap the order of the iterations, yielding:

sumMatrix := empty
for each row in 0..2^#qubits:
  baseRow := empty
  for each (basePauli, coeff) in sparsePauliOp:
    baseRow = baseRow <append> (col, value for nonzero value in this basePauli row)
  sort baseRow by col#, adding value for columns with more than one value
  add baseRow to sumMatrix

This code has a natural parallelized implementation by taking the "for each row" loop and running it in parallel.

I have code that can do this in this repo: https://github.com/chetmurthy/qrusty

and specifically this function: https://github.com/chetmurthy/qrusty/blob/ea11318ff4c92d33f2141841b6e28eb4cd1df8b3/qrusty/src/accel.rs#L267 which constructs the three vectors required to build a CSR matrix.

I've built code around this with classes that match "BasePauli" and "SparsePauliOp", so that this Rust code can be used from Python outside of Qiskit, and also a Python wrapper of the sprs sparse-matrix type, because the people I'm working with also find that Python Scipy is not parallelized for many operations (e.g. matrix-vector multiply, "ax+by", evaluating "cMx + dMx" for complex "c", vector "x") that need to be parallelized when matrix dimension reaches 2^20 and above.

There are several ways one might import the basic function of "fast construction of matrices from SparsePauliOp" into Qiskit:

  1. just pull in this function (link above) and write enough wrapper to use it from Qiskit

  2. refer to Qrusty's implementation (== "construct the qrusty SparsePauliOp, then call the function to construct a SpMat, then convert to a Scipy CSR matrix") but leave Qrusty as-is, so that people who need to do more parallelized matrix operations can do so with Qrusty

  3. integrate Qrusty into Qiskit, slowly making qrusty's version of BasePauli and SparsePauliOp into the implementation of Qiskit's versions of those objects.

It seems clear that #3 is a lot of work, maybe even not desirable. But #1 has the problem that for anybody who actually wants to work with high # of qubits, once you have the matrix you can't do much with it without converting it back into a form where you can then apply Rust parallelism to it. B/c the matrix is so bloody big.

My belief is, #3 is the way to go for this reason.

nonhermitian commented 1 year ago

While this is probably going to speed things up, it is likely not an ideal solution; you still have big matrices. Rather, it is easy to directly compute matrix elements given a row and column index and a list of Paulis. Thus you can do things like matrix-vector operations without constructing a matrix at all. Generally these are known as "matrix-free methods"

chetmurthy commented 1 year ago

Two thoughts:

  1. true enough, but you're suggesting a larger change in the way people write codes in this domain, where this issue is just trying to accelerate the codes as-is.
  2. In the case I mention with 24 qubits and 20k Pauli strings, computing the value of any particular matrix cell means computing whether each of those 20k Pauli strings contributes a nonzero value: while sure, you can do this matrix-free, it's still much more inefficient than computing the matrix once, I would think. Though I could be wrong.
nonhermitian commented 1 year ago

It depends really. But agreed that it is more work than modifying the current operators.

chetmurthy commented 1 year ago

I am curious how one might apply matrix-free methods to computing sparse-matrix x dense-vector product. Musing on it, it seems like, you want to compute each row sparsely, which means executing the per-row algorithm I described in the issue text. So you compute the 20k nonzero positions in the 2^24-length row, multiply by the value in the dense-vector for each one, and add 'em all up. But in the instance, the 24 qubit matrix has 5B nonzero entries -- so that means between 2^8 and 2^9 nonzero entries per row (2^33 / 2^24) so it seems like that's a lot of work being done to avoid .... memoizing the row.

But OTOH, I do take your point, in the sense that, if by doing this you can avoid moving around massive amounts of memory, then maybe it pays for itself just in locality.

I think I should code it up and see! Is the above description of what happens on each row of the sparse matrix x dense vecxtor product an accurate reflection of what you were imagining?

nonhermitian commented 1 year ago

So the basic idea follows Eq.4 from here: https://journals.aps.org/prxquantum/abstract/10.1103/PRXQuantum.2.040326

Namely given a row and col expressed as integers, you can convert them to their bit-string values and use those values to index the elements of each 1Q operator in the operator string.

Now ideally you would not do this for I because those do not matter. In addition, one can usually break the element computation early if you hit a zero element since the equation is a product.

You can do it in parallel over the rows (or cols), which is commonly done. In your case, if you did a large sparse matrix dense vector computation (spMV as it is known) you would find that the benefits of parallel iteration diminish with the number of nonzero elements in the matrix (NNZ). This is because such operations are memory bandwidth limited, and the floating-point capabilities of the CPU are waiting on the memory IO. The limit is basically the cache size of your processor.

Now ideally all of this would be done as part of a someone larger effort on SparseOperators as @mtreinish and myself have been discussing recently. As you have seen, there are efficiency improvements to be found from the Qiskit operators, not just in terms of dropping them down to lower-level languages, but also just in terms of structure and algorithms.

ikkoham commented 1 year ago

In general, we would welcome the speedup with Rust as long as there are no API changes.

The bottleneck of the SparsePauliOp.to_matrix() method is the following line: https://github.com/Qiskit/qiskit-terra/blob/5c19a1f34935cc0a230878fc67539af59cc0ab2a/qiskit/quantum_info/operators/symplectic/base_pauli.py#L434 I think it is easiest and most acceptable to rustify this line.

SparsePauliOp's constructor is slow: I think we can rustify https://github.com/Qiskit/qiskit-terra/blob/5c19a1f34935cc0a230878fc67539af59cc0ab2a/qiskit/quantum_info/operators/symplectic/pauli_list.py#L170-L183.

Should I make everything rust?: I don't know if it's possible or not, and isn't it too hard since there are many methods? And we may want to support jax in the future (not definitive...)

chetmurthy commented 1 year ago

Hamamura-san,

  1. Indeed, I first rustified that line, and it was worth a speedup. But you still construct massive matrices for each Pauli string, only to "add them up" to get your matrix for the SparsePauliOp.

  2. The next step is to observe that the "add them up" is a kind of insertion-sort (viewing each row of the matrix as a list of col/value pairs, adding two rows as vectors is like merging them) where "sum += mat_i" "inserts" the values of "mat_i" into "sum". So you can speed that up by turning the insertion-sort into a mergesort in the obvious way.

These are worth a good speedup, and make 20-qubit problems tractable (instead of "cannot wait for the matrix to be constructed"), it takes some hours. But it still takes hours.

  1. To do better, you have to build the final matrix row-by-row, directly from the list of Pauli-strings (or their bit-vector equivalents) and coefficients. And if you do that, you can get the time down for a 20-qubit problem to negligible and for 24-qubit problems to ... ten minutes. Something like that. Fast enough that it's inconsequential compared to (for example) extracting the lowest eigenvalue of the matrix using the PySCF Davidson solver.
chetmurthy commented 1 year ago

the basic idea follows Eq.4

I'd like to continue to try to understand your proposal. I'm going to describe what I think is the situation and hopefully I get it right. I'm not saying you're wrong, but rather, that I don't quite understand how to do this efficiently at the sizes we're contemplating. I'm going to describe the ramifications of your observation/proposal, and if you think I've got it right, then I'll off and code it up.

That method for constructing the matrix that corresponds to a Pauli string, is what's in Qiskit, and what I've implemented in the Rust code of qrusty (directly copying Qiskit's Python code). Or, at least, I believe so. Now, the matrix corresponding to a Pauli string has one nonzero value on each row, so when we multiple such a matrix by a dense vector, that row x vector dot-product becomes a vector-lookup and complex-multiply, right?

But when we have 20k or 30k such Pauli strings, then (for each row) we can either

  1. iterate over the Pauli-strings, computing the (col,value) nonzero from each, then doing a vector-lookup, to compute the dot-product (which means random-access to the vector
  2. or iterate over the Pauli-strings, compute the (col, value) nonzeros, sort by col, construct what is effectively a sparse-vector, and then do a sparse-vector x dense-vector dot-product.

For the largest 24-qubit problem I'm working on, we've got 92B nonzeros. That's ~2^37. So 2^13 nonzeros per row. So those 30k Pauli strings produce some (col,value) with the same column-number. We can either coalesce, constructing (effectively) the (sparse) row of the matrix, or we can just generate the nonzeros one-by-one as we dot-product. If we do the former, we can march down the 2^24 entry dense-vector linearly; if we do the latter, we have to random-access into the vector.

It seems like the efficient plan is to construct the row, sorting it into a sparse-vector. But at that point, it comes down to which is more efficient:

a. construct the row and save it (e.g., build the sparse-matrix), reusing it for each spMV b. construct the row each time we perform an spMV, use it and then discard it

Is that about correct? If so, I think this is testable.

ETA: 2^13 nonzeros in a sparse format is 2^18 bytes (2^5 b/c complex + 2 longs, maybe a little less); 2^24 dense vector is 2^28 bytes. So we're talking larger than cache size regardless.

And I should add that there are people I'm working with who are doing similar things for 2^27 qubits (but with only 55 Pauli strings, so vastly smaller matrices; still, the dense vector is 2^27 entries, so 2^31 bytes).

nonhermitian commented 1 year ago

Yeah, I forgot that the restriction to Paulis simplifies things a bit.

It is easy to compute each row given the form of the Pauli operator. Expressing your row index in as a bit-string you can get the corresponding column value by flipping the bits that correspond to non-diagonal Paulis. Computing the element value itself is linear in the number of qubits.

So then yeah, is it better to form a matrix representation of the whole, or go on by one? Typically matrix-free things are not the fastest, but rather are meant to scale. I think, after realizing I was missing the Pauli simplification, that it is probably best to do like you are now. The CSR matrix is efficient to construct here because you can build row by row already.

chetmurthy commented 1 year ago

Ah, well. I was hoping for a little (ahem) hunting trip thru the land of code. Oh, well. Thank you for the sightseeing though!

aeddins-ibm commented 9 months ago

Since this issue has been idle for a while, just wanted to say I recently hit the same bottleneck described in this issue, and found the linked qrusty code still gives a valuable speed-up.

I am naively doing the conversion by calling these two functions in order:

def spo_to_sparse_qrusty(spo):
    paulis = [qrusty.Pauli(l.to_label()) for l in spo.paulis]
    coeffs = spo.coeffs
    return qrusty.SparsePauliOp(paulis, coeffs).to_matrix_mode("RowwiseUnsafeChunked/1000")

def sparse_qrusty_to_sparse_mat(spop):
    shape, data, indices, indptr = spop.export()
    return csr_matrix((data, indices, indptr), shape=shape)

While this is probably not optimal, I found for large SPO's this was vastly faster than using Qiskit's existing SparsePauliOp.to_matrix(sparse=True). So it seems like it would still be very worthwhile to update Qiskit to use this routine or something similar if feasible.

jakelishman commented 9 months ago

I'll have a look at bringing some upgraded SPO-to-matrix routines in-house if it's currently being a bit of a bottleneck for people. The current version of the code isn't intended to be performant, just simple to get working. I've got to take a transatlantic flight on Saturday, which is as good an excuse as any to do a bit of Rust acceleration.

Writing an largely new quantum_info module based more around matrix-free methods would be a much larger undertaking that we'd need to set aside a lot more time for, and of course any time we expose stuff to Python is going to suffer the same problems that heavy Numpy-based code has (and that things like tensorflow / jax / numba solve in various ways) - everything has to be constructed as "vectorised" loops over simple primitive kernels, which leads to additional loop overhead and temporary allocations.

jakelishman commented 9 months ago

@aeddins-ibm: just for benchmarking purposes, do you have a rough estimate for the numbers of qubits in your operators and the number of Pauli terms?

I wrote an implementation of both dense-matrix and CSR-matrix construction methods that directly act on Qiskit's SparsePauliOp, and the performance is pretty promising (a shade faster than qrusty for the problem domains I tried, even with qrusty threaded and the new one as-yet running single-threaded), but I could do with checking more closely to the problem domains you care about.

aeddins-ibm commented 9 months ago

Thanks Jake, that's very cool you can match the performance even single-threaded. It's not pressing but I'd be interested in trying your method at some point. The convenience/portability of being able to stay within standard Qiskit is definitely a plus as well.

jakelishman commented 9 months ago

@aeddins-ibm: I've just pushed up #11388 which shows the general method here. I spent a bit of time tidying it up in a software-engineering sense, to absolutely minimise the amount of unsafe code we have to play with. There's probably still ways to even out the performance of it (something seems a little weird with the allocations/de-allocations on my Mac at exactly 12q but not above or below), but it's already pretty fast and I ought to go work on other stuff now.