Closed rht closed 1 year ago
Hello, thanks for reaching out.
First, regarding the performance statement in our tech blog for cuQuantum beta, the benchmark data shown in the blog was measured with cuStateVec. Both cuQuantum libraries (cuStateVec/cuTensorNet) have been improved significantly compared to when the blog was written. Generally compared with state vector simulation, tensor networks trade computational cost for memory saving, and one would expect some performance overhead.
Going back to your benchmark, there are a couple of things to note.
First, when measuring the performance of GPU functions, you should not use timing utils for CPU. You can try either cupyx.profiler.benchmark()
or use cupy.cuda.get_elapsed_time()
together with CUDA events (cupy.cuda.Event()
).
Second, when calling cuquantum.contract
, the function call contains two step, one for path-finding and the other for contraction execution. You can time the two steps separately with following modifications:
path, info = contract_path(expression, *operands)
contract(expression, *operands, optimize={"path":path})
and you’ll see that in your test cases, more time is spent on the path-finding step than on actual execution, and the execution time can be often lower than that of state vector simulation (as in qsimcirq or cuStateVec). As the problem size further increases, the execution time will rapidly scale up and dominate. A general good strategy is to cache and reuse these paths if you’re contracting the same tensor network multiple times. Checkout our python sample and notebook sample (inside get_expectation_cutn
function)
Third, you’re tuning the right knob for expectation value computation. For the QFT circuit, when you turn on the lightcone
option, the network size is greatly reduced when you’re measuring only one or few qubits. You can easily see this if you compare the number of operands for your expectation
and batched_amplitudes
computations. Since the graph size for expectation value is much smaller than amplitudes, you can see that the amount of time for path-finding and execution are both lower than that of amplitude computation.
Finally, another small tip for your snippet is to reuse the cuTensorNet handle to reduce the overhead in small problems, e.g:
import cuquantum.cutensornet as cutn
handle = cutn.create()
def simulate_and_measure(nqubits, handle):
…
path, info = contract_path(expression, *operands, options={"handle": handle})
contract(expression, *operands, optimize={"path":path}, options={"handle": handle})
…
cutn.destroy(handle)
and if you’re using cuQuantum Python 22.11 or above, enabling the nonblocking behavior would further help overlap CPU/GPU activities and reduce the run time, especially when both path finding and contraction execution take long time:
import cuquantum.cutensornet as cutn
handle = cutn.create()
def simulate_and_measure(nqubits, handle):
…
path, info = contract_path(expression, *operands, options={"handle": handle, "blocking": "auto"})
contract(expression, *operands, optimize={"path":path}, options={"handle": handle, "blocking": "auto"})
…
cutn.destroy(handle)
also refer to our "Network"-object based samples if you figure you would reuse the same network topology multiple times. It’d help further reduce the overhead in Python object management .
Let me convert this to a GitHub Discussion thread and we can continue there. Thanks.
I was trying to reproduce the statement in https://developer.nvidia.com/blog/nvidia-announces-cuquantum-beta-availability-record-quantum-benchmark-and-quantum-container/, in particular
I'm running the following minimal reproducer on Python 3.8, cuQuantum 22.11, NVIDIA A100 40 GB (on a GCP instance)
Output (the numbers are elapsed in seconds; 10, 15, ... are number of qubits for QFT):
I wasn't able to go to 35 qubits for qsim, because I got CUDA OOM for qsim. The much reduced memory usage alone is sufficient to prefer cuQuantum for this use case.
But, I was hoping that
batched_amplitudes
is going to be faster than a full statevector simulation, because some qubits are fixed. But it doesn't seem to be the case. I have also triedreduced_density_matrix
(not shown, so that the code snippet is short). The only one that is consistently fast isexpectation
. I wonder if I did it wrongly?