time portion for simulations is small compared to pre- and post- processing

yitchen-tim commented 1 year ago

Issue description

There is a subroutine called apply_cq in lightning_gpu that evolves states based on the quantum circuit. Is apply_cq the only place that evolves the quantum state? For example, if I profile the run time of this routine, is it a good representation of how long it takes for GPU computing?

In my profiling, a 28-qubit GHZ circuit that took about 10 seconds, only 0.01% of the run time was spent on apply_cq. Initializing, data moving and post-processing took the majority of the run time. Is there a roadmap that these overheads can be reduced? (for example, maybe it can move some post-processing to GPU so that we can avoid moving big chunk of data around?).

Expected behavior: (What you expect to happen) Most run time is spent on simulation
Actual behavior: (What actually happens) Most run time is spent on initialization, data moving and post-processing
Reproduces how often: (What percentage of the time does it reproduce?) Everytime

System information: (post the output of import pennylane as qml; qml.about())

Platform info:           Linux-5.15.0-1017-aws-x86_64-with-glibc2.31
Python version:          3.9.4
Numpy version:           1.23.2
Scipy version:           1.9.0
Installed devices:
- default.gaussian (PennyLane-0.25.1)
- default.mixed (PennyLane-0.25.1)
- default.qubit (PennyLane-0.25.1)
- default.qubit.autograd (PennyLane-0.25.1)
- default.qubit.jax (PennyLane-0.25.1)
- default.qubit.tf (PennyLane-0.25.1)
- default.qubit.torch (PennyLane-0.25.1)
- default.qutrit (PennyLane-0.25.1)
- lightning.qubit (PennyLane-Lightning-0.25.1)
- lightning.gpu (PennyLane-Lightning-GPU-0.25.0)

Source code and tracebacks

import pennylane as qml
import time

N = 28
dev = qml.device("lightning.gpu", wires=N, shots=10)
@qml.qnode(dev)
def ghz():
    qml.Hadamard(wires=0)
    for i in range(0,N-1):
        qml.CNOT(wires=[i,i+1])
    return qml.probs(wires=range(N))

start = time.time()
result = ghz()
print(time.time()-start)

mlxd commented 1 year ago

Hi @yitchen-tim thanks for the info on this. If you'd be happy to share some nvprof data, or even a cProfile output from this we'd be happy to take a look. We are aware that setup and initialization of the CUDA device (at least for the first instantiation) can be heavy, but we would need to confirm first that the issue is on our side and not the CUDA driver and context setup as part of cuQuantum's initialization.

yitchen-tim commented 1 year ago

Setting diff_method=None does not change the run time. It seems that hooking up to ML library may not be the bottleneck.

Trying the deeper, below are the results (both simulators uses cuQuantum). There are about 9 seconds of fixed overhead. The scaling for both simulators is the same when increasing the circuit depth. N=28, L=1

lightning.gpu: 9.05 s
qsimcirq: 1.7 s

N=28, L=10

lightning.gpu: 10.5 s
qsimcirq: 2.4 s

N=28, L=100

lightning.gpu: 24.6 s
qsimcirq: 15.1 s

Note: N is qubit count, L is the number of GHZ layers.

mlxd commented 1 year ago

Hi @yitchen-tim with the merge of https://github.com/PennyLaneAI/pennylane-lightning-gpu/pull/70 this issue should now be resolved. Feel free to try it out the current master, or wait for 3 weeks and try out release v0.28.0.

Closing for now as this is resolved, but feel free to reopen if you see this is not the case.

PennyLaneAI / pennylane-lightning-gpu

time portion for simulations is small compared to pre- and post- processing #54

Issue description

Source code and tracebacks