Open tomlqc opened 2 weeks ago
I have performed benchmark with the following setup:
num_wires | default | lightning |
---|---|---|
5 | 0.02415302 | 0.01594006 |
6 | 0.02579262 | 0.01603134 |
7 | 0.02741707 | 0.01615754 |
8 | 0.02954822 | 0.01612877 |
9 | 0.03091021 | 0.01626475 |
10 | 0.03338226 | 0.01591212 |
11 | 0.03487502 | 0.01638795 |
12 | 0.03691649 | 0.01649593 |
13 | 0.03912803 | 0.01675916 |
14 | 0.04284222 | 0.01769783 |
15 | 0.04721544 | 0.01946455 |
16 | 0.05462405 | 0.02161267 |
17 | 0.06258643 | 0.02764239 |
18 | 0.07848529 | 0.08313 |
19 | 0.2460156 | 0.12398331 |
20 | 0.21531559 | 0.24791661 |
21 | 0.71412075 | 0.50003818 |
22 | 1.45517968 | 1.02306128 |
23 | 2.99782932 | 2.01809743 |
24 | 5.91070788 | 3.93206317 |
25 | 11.6924843 | 7.68381341 |
Across the range of num_wires
, lightning.qubit
device runs faster default.qubit
(I did not seem to observe default.qubit
being faster than lightning.qubit
)
I chose to use 2 profilers:
Here I used a larger num_wires=27 to help identity the memory allocations. We first look at the profiling from memray (lightning.qubit
) which shows the calls with the largest memory allocation. (I repeated the circuit twice as seen in the diagram, and focus only on the first.)
From the call stack we can see three distinct phases of memory allocation.
dev = qml.device(device_name, wires=num_wires
, which uses pybind to call the allocation functions. This is not related to the circuit/qml.state()
. Within the circuit qml.state()
, when the measurement is performed in state_diagonalizing_gates
:
state_array = self._qubit_state.state
in _measurement.py
, a new numpy array is created in memory (np.zeros
in _state_vector.py
) before the data is copied from the C++ array to the numpy array (via self._qubit_state.getState(state)
in _state_vector.py
). result = measurementprocess.process_state(state_array, wires)
in _measurements.py
, this calls Pennylane
's process_state
in state.py
. At return qml.math.cast(state, "complex128") if is_tf_interface else state + 0.0j
, it returns the state state + 0.0j
. This creates an extra (in-theory temporary) copy of the state in numpy.During the application, the state-vector in the C++ memory buffer is:
The python-binding to these C++ state-vector manipulations are from self._qubit_state
in this class here: https://github.com/PennyLaneAI/pennylane-lightning/blob/master/pennylane_lightning/lightning_qubit/_state_vector.py#L41 ; this is used to call the methods to create/read/update the state-vector in C++ memory buffer from python.
Each of the copies of state vectors is about 2GB (2^27qubits * 128b complex = 2GB), and with the 3 copies created from above, explains the peak usage at ~6.51GB.
Comparing the memory footprint to the pure python default.qubit
implementation, in default.qubit
there isn't an extra copy in C++, and a new copy of the array is not created at return qml.math.cast(state, "complex128") if is_tf_interface else state + 0.0j
, which results in a much lower memory footprint (at ~3.28 GB):
Back to lightning.qubit
, in terms of the timing cost of the memory operations, we can look at the MAP profiler result:!
This confirms that:
self._qubit_state.getState(state)
, right after the python vector is created with np.zero
)array_add
coming from return qml.math.cast(state, "complex128") if is_tf_interface else state + 0.0j
The latter two points above might be improved.
In terms of copying the C++ state-vector to python numpy, this may not be necessary. For this circuit, since no further gates are applied before returning the state, there is no operations before the copy. And if there is no need for an explicit copy in python, we can simply expose the C++ array by creating a view in python, without copying it like in https://github.com/PennyLaneAI/pennylane-lightning/blob/master/pennylane_lightning/core/src/simulators/lightning_qubit/bindings/LQubitBindings.hpp#L206 . It might be beneficial to have both a copy and a view method to improve general memory management.
In terms of the the array_add
from the last point, in this case it is unclear why state + 0.0j
is returned instead of state
(assuming the initial state is the correct complex datatype).
By returning state
instead of state + 0.0j
, it means that there is no need for a new temporary copy of the state-vector in numpy. This results in a lower memory consumption (~4.36GB). From quick testing this seems to produce identical result, but needs further/more rigorous testing to show it is correct.
Important Note
⚠️ This issue is part of an internal assignment and not meant for external contributors.
Context
The
lightning.qubit
device in PennyLane-Lightning has optimal support for many quantum gates and measurement processes at both the Python and C++ layers. TheLightningMeasurements
class at lightning_qubit/_measurements.py implements the Python interface for performant C++ measurement routines in MeasurementsLQubit. Among PennyLane's measurement processes,qml.state
, that returns the underlying quantum state in the computational basis, is backed by the public methods of StateVectorLQubitManaged.hpp.The Python <> C++ memory management plays an important role in the performance of
qml.state
, although returning the underlying state-vector is not computationally intensive. Some preliminary results determined poor scaling ofqml.state
inlightning.qubit
comparing todefault.qubit
, the default pure-Python Pennylane device.Requirements
lightning.qubit
vsdefault.qubit
. In this code sample,device_name
can be eitherlightning.qubit
ordefault.qubit
and5 < num_wires < 25
. Define some thresholds whendefault.qubit
is faster thanlightning.qubit
.qml.state
?Please provide your answers as follow-up comments in this github issue. You may use Github Gist for larger files.
Feel free to ask any questions or raise any concerns regarding the issue. We'll be happy to discuss with you!