NVIDIA / cuQuantum

Home for cuQuantum Python & NVIDIA cuQuantum SDK C++ samples
https://docs.nvidia.com/cuda/cuquantum/
BSD 3-Clause "New" or "Revised" License
339 stars 64 forks source link

CircuitToEinsum for QFT: batched_amplitudes slower than qsim full statevector simulation #22

Closed rht closed 1 year ago

rht commented 1 year ago

I was trying to reproduce the statement in https://developer.nvidia.com/blog/nvidia-announces-cuquantum-beta-availability-record-quantum-benchmark-and-quantum-container/, in particular

Quantum Fourier Transform – accelerated from 29 mins down to 19 secs

I'm running the following minimal reproducer on Python 3.8, cuQuantum 22.11, NVIDIA A100 40 GB (on a GCP instance)

import time

import cirq
import qsimcirq
import cupy
from cuquantum import contract
from cuquantum import CircuitToEinsum

simulator = qsimcirq.QSimSimulator()

# See https://quantumai.google/cirq/experiments/textbook_algorithms
def make_qft(qubits):
    """Generator for the QFT on a list of qubits."""
    qreg = list(qubits)
    while len(qreg) > 0:
        q_head = qreg.pop(0)
        yield cirq.H(q_head)
        for i, qubit in enumerate(qreg):
            yield (cirq.CZ ** (1 / 2 ** (i + 1)))(qubit, q_head)

def simulate_and_measure(nqubits):
    qubits = cirq.LineQubit.range(nqubits)
    qft = cirq.Circuit(make_qft(qubits))

    myconverter = CircuitToEinsum(qft, backend=cupy)

    tic = time.time()
    simulator.simulate(qft)
    elapsed_qsim = time.time() - tic
    out = {"qsim": elapsed_qsim}

    # CUDA expectation
    pauli_string = {qubits[0]: 'Z'}
    expression, operands = myconverter.expectation(pauli_string, lightcone=True)
    tic = time.time()
    contract(expression, *operands)
    elapsed = time.time() - tic
    out["cu_expectation"] = elapsed

    # CUDA Batched amplitudes
    # Fix everything but last qubit
    fixed_states = "0" * (nqubits - 1)
    fixed_index = tuple(map(int, fixed_states))
    num_fixed = len(fixed_states)
    fixed = dict(zip(myconverter.qubits[:num_fixed], fixed_states))
    expression, operands = myconverter.batched_amplitudes(fixed)
    tic = time.time()
    contract(expression, *operands)
    elapsed = time.time() - tic
    out["cu_batched"] = elapsed

    return out

for i in [10, 15, 20, 25, 30]:
    print(i, simulate_and_measure(i))

Output (the numbers are elapsed in seconds; 10, 15, ... are number of qubits for QFT):

10 {'qsim': 0.9677999019622803, 'cu_expectation': 0.29337143898010254, 'cu_batched': 0.07590365409851074}
15 {'qsim': 0.023270368576049805, 'cu_expectation': 0.019628524780273438, 'cu_batched': 0.3687710762023926}
20 {'qsim': 0.03504538536071777, 'cu_expectation': 0.023822784423828125, 'cu_batched': 0.9347813129425049}
25 {'qsim': 0.14235782623291016, 'cu_expectation': 0.02486586570739746, 'cu_batched': 2.39030122756958}
30 {'qsim': 3.4044816493988037, 'cu_expectation': 0.028923749923706055, 'cu_batched': 4.6819908618927}
35 {'cu_expectation': 1.0615959167480469, 'cu_batched': 10.964831829071045}
40 {'cu_expectation': 0.03381609916687012, 'cu_batched': 82.43729209899902}

I wasn't able to go to 35 qubits for qsim, because I got CUDA OOM for qsim. The much reduced memory usage alone is sufficient to prefer cuQuantum for this use case.

But, I was hoping that batched_amplitudes is going to be faster than a full statevector simulation, because some qubits are fixed. But it doesn't seem to be the case. I have also tried reduced_density_matrix (not shown, so that the code snippet is short). The only one that is consistently fast is expectation. I wonder if I did it wrongly?

yangcal commented 1 year ago

Hello, thanks for reaching out.

First, regarding the performance statement in our tech blog for cuQuantum beta, the benchmark data shown in the blog was measured with cuStateVec. Both cuQuantum libraries (cuStateVec/cuTensorNet) have been improved significantly compared to when the blog was written. Generally compared with state vector simulation, tensor networks trade computational cost for memory saving, and one would expect some performance overhead.

Going back to your benchmark, there are a couple of things to note.

First, when measuring the performance of GPU functions, you should not use timing utils for CPU. You can try either cupyx.profiler.benchmark() or use cupy.cuda.get_elapsed_time() together with CUDA events (cupy.cuda.Event()).

Second, when calling cuquantum.contract, the function call contains two step, one for path-finding and the other for contraction execution. You can time the two steps separately with following modifications:


path, info = contract_path(expression, *operands) 

contract(expression, *operands, optimize={"path":path}) 

and you’ll see that in your test cases, more time is spent on the path-finding step than on actual execution, and the execution time can be often lower than that of state vector simulation (as in qsimcirq or cuStateVec). As the problem size further increases, the execution time will rapidly scale up and dominate. A general good strategy is to cache and reuse these paths if you’re contracting the same tensor network multiple times. Checkout our python sample and notebook sample (inside get_expectation_cutn function)

Third, you’re tuning the right knob for expectation value computation. For the QFT circuit, when you turn on the lightcone option, the network size is greatly reduced when you’re measuring only one or few qubits. You can easily see this if you compare the number of operands for your expectation and batched_amplitudes computations. Since the graph size for expectation value is much smaller than amplitudes, you can see that the amount of time for path-finding and execution are both lower than that of amplitude computation.

Finally, another small tip for your snippet is to reuse the cuTensorNet handle to reduce the overhead in small problems, e.g:


import  cuquantum.cutensornet as cutn 

handle = cutn.create() 

def simulate_and_measure(nqubits, handle): 

    …

    path, info = contract_path(expression, *operands, options={"handle": handle}) 

    contract(expression, *operands, optimize={"path":path}, options={"handle": handle}) 

    …

cutn.destroy(handle) 

and if you’re using cuQuantum Python 22.11 or above, enabling the nonblocking behavior would further help overlap CPU/GPU activities and reduce the run time, especially when both path finding and contraction execution take long time:

import  cuquantum.cutensornet as cutn 

handle = cutn.create() 

def simulate_and_measure(nqubits, handle): 

    …

    path, info = contract_path(expression, *operands, options={"handle": handle, "blocking": "auto"}) 

    contract(expression, *operands, optimize={"path":path}, options={"handle": handle, "blocking": "auto"}) 

    …

cutn.destroy(handle) 

also refer to our "Network"-object based samples if you figure you would reuse the same network topology multiple times. It’d help further reduce the overhead in Python object management .

leofang commented 1 year ago

Let me convert this to a GitHub Discussion thread and we can continue there. Thanks.