PennyLaneAI / pennylane-lightning-gpu

GPU enabled Lightning simulator for accelerated circuit simulation. See https://github.com/PennyLaneAI/pennylane-lightning for all future development of this project.
https://docs.pennylane.ai/projects/lightning/en/stable/
Apache License 2.0
49 stars 10 forks source link

time portion for simulations is small compared to pre- and post- processing #54

Closed yitchen-tim closed 1 year ago

yitchen-tim commented 1 year ago

Issue description

There is a subroutine called apply_cq in lightning_gpu that evolves states based on the quantum circuit. Is apply_cq the only place that evolves the quantum state? For example, if I profile the run time of this routine, is it a good representation of how long it takes for GPU computing?

In my profiling, a 28-qubit GHZ circuit that took about 10 seconds, only 0.01% of the run time was spent on apply_cq. Initializing, data moving and post-processing took the majority of the run time. Is there a roadmap that these overheads can be reduced? (for example, maybe it can move some post-processing to GPU so that we can avoid moving big chunk of data around?).

Source code and tracebacks

import pennylane as qml
import time

N = 28
dev = qml.device("lightning.gpu", wires=N, shots=10)
@qml.qnode(dev)
def ghz():
    qml.Hadamard(wires=0)
    for i in range(0,N-1):
        qml.CNOT(wires=[i,i+1])
    return qml.probs(wires=range(N))

start = time.time()
result = ghz()
print(time.time()-start)
mlxd commented 1 year ago

Hi @yitchen-tim thanks for the info on this. If you'd be happy to share some nvprof data, or even a cProfile output from this we'd be happy to take a look. We are aware that setup and initialization of the CUDA device (at least for the first instantiation) can be heavy, but we would need to confirm first that the issue is on our side and not the CUDA driver and context setup as part of cuQuantum's initialization.

yitchen-tim commented 1 year ago

Setting diff_method=None does not change the run time. It seems that hooking up to ML library may not be the bottleneck.

Trying the deeper, below are the results (both simulators uses cuQuantum). There are about 9 seconds of fixed overhead. The scaling for both simulators is the same when increasing the circuit depth. N=28, L=1

N=28, L=10

N=28, L=100

Note: N is qubit count, L is the number of GHZ layers.

mlxd commented 1 year ago

Hi @yitchen-tim with the merge of https://github.com/PennyLaneAI/pennylane-lightning-gpu/pull/70 this issue should now be resolved. Feel free to try it out the current master, or wait for 3 weeks and try out release v0.28.0.

Closing for now as this is resolved, but feel free to reopen if you see this is not the case.