PennyLaneAI / pennylane-lightning

The PennyLane-Lightning plugin provides a fast state-vector simulator written in C++ for use with PennyLane
https://docs.pennylane.ai/projects/lightning
Apache License 2.0
76 stars 33 forks source link

Memory profiling of `qml.state()` #771

Open tomlqc opened 2 weeks ago

tomlqc commented 2 weeks ago

Important Note

⚠️ This issue is part of an internal assignment and not meant for external contributors.

Context

The lightning.qubit device in PennyLane-Lightning has optimal support for many quantum gates and measurement processes at both the Python and C++ layers. The LightningMeasurements class at lightning_qubit/_measurements.py implements the Python interface for performant C++ measurement routines in MeasurementsLQubit. Among PennyLane's measurement processes, qml.state, that returns the underlying quantum state in the computational basis, is backed by the public methods of StateVectorLQubitManaged.hpp.

The Python <> C++ memory management plays an important role in the performance of qml.state, although returning the underlying state-vector is not computationally intensive. Some preliminary results determined poor scaling of qml.state in lightning.qubit comparing to default.qubit, the default pure-Python Pennylane device.

Requirements

    dev = qml.device(device_name, wires=num_wires)
    @qml.qnode(dev)
    def circuit():
        return qml.state()

Please provide your answers as follow-up comments in this github issue. You may use Github Gist for larger files.

Feel free to ask any questions or raise any concerns regarding the issue. We'll be happy to discuss with you!

josephleekl commented 1 week ago

Benchmarking

I have performed benchmark with the following setup:

num_wires default lightning
5 0.02415302 0.01594006
6 0.02579262 0.01603134
7 0.02741707 0.01615754
8 0.02954822 0.01612877
9 0.03091021 0.01626475
10 0.03338226 0.01591212
11 0.03487502 0.01638795
12 0.03691649 0.01649593
13 0.03912803 0.01675916
14 0.04284222 0.01769783
15 0.04721544 0.01946455
16 0.05462405 0.02161267
17 0.06258643 0.02764239
18 0.07848529 0.08313
19 0.2460156 0.12398331
20 0.21531559 0.24791661
21 0.71412075 0.50003818
22 1.45517968 1.02306128
23 2.99782932 2.01809743
24 5.91070788 3.93206317
25 11.6924843 7.68381341

Picture 1

Across the range of num_wires , lightning.qubit device runs faster default.qubit (I did not seem to observe default.qubit being faster than lightning.qubit)

Profiling

I chose to use 2 profilers:

Here I used a larger num_wires=27 to help identity the memory allocations. We first look at the profiling from memray (lightning.qubit) which shows the calls with the largest memory allocation. (I repeated the circuit twice as seen in the diagram, and focus only on the first.) Screenshot 2024-06-23 at 18 15 28

Memory allocation

From the call stack we can see three distinct phases of memory allocation.

  1. The initial state-vector is allocated in memory in C++ during dev = qml.device(device_name, wires=num_wires , which uses pybind to call the allocation functions. This is not related to the circuit/qml.state() .
  2. Within the circuit qml.state(), when the measurement is performed in state_diagonalizing_gates:

    1. At state_array = self._qubit_state.state in _measurement.py , a new numpy array is created in memory (np.zeros in _state_vector.py ) before the data is copied from the C++ array to the numpy array (via self._qubit_state.getState(state) in _state_vector.py).
    2. At result = measurementprocess.process_state(state_array, wires) in _measurements.py, this calls Pennylane's process_state in state.py. At return qml.math.cast(state, "complex128") if is_tf_interface else state + 0.0j , it returns the state state + 0.0j. This creates an extra (in-theory temporary) copy of the state in numpy.

    During the application, the state-vector in the C++ memory buffer is:

The python-binding to these C++ state-vector manipulations are from self._qubit_state in this class here: https://github.com/PennyLaneAI/pennylane-lightning/blob/master/pennylane_lightning/lightning_qubit/_state_vector.py#L41 ; this is used to call the methods to create/read/update the state-vector in C++ memory buffer from python.

Each of the copies of state vectors is about 2GB (2^27qubits * 128b complex = 2GB), and with the 3 copies created from above, explains the peak usage at ~6.51GB.

Comparing the memory footprint to the pure python default.qubit implementation, in default.qubit there isn't an extra copy in C++, and a new copy of the array is not created at return qml.math.cast(state, "complex128") if is_tf_interface else state + 0.0j , which results in a much lower memory footprint (at ~3.28 GB):

Screenshot 2024-06-23 at 20 52 03

Runtime cost

Back to lightning.qubit, in terms of the timing cost of the memory operations, we can look at the MAP profiler result:! Screenshot 2024-06-23 at 19 01 53

This confirms that:

Bottlenecks

The latter two points above might be improved.

In terms of copying the C++ state-vector to python numpy, this may not be necessary. For this circuit, since no further gates are applied before returning the state, there is no operations before the copy. And if there is no need for an explicit copy in python, we can simply expose the C++ array by creating a view in python, without copying it like in https://github.com/PennyLaneAI/pennylane-lightning/blob/master/pennylane_lightning/core/src/simulators/lightning_qubit/bindings/LQubitBindings.hpp#L206 . It might be beneficial to have both a copy and a view method to improve general memory management.

In terms of the the array_add from the last point, in this case it is unclear why state + 0.0j is returned instead of state (assuming the initial state is the correct complex datatype).

Possible improvement

By returning state instead of state + 0.0j, it means that there is no need for a new temporary copy of the state-vector in numpy. This results in a lower memory consumption (~4.36GB). From quick testing this seems to produce identical result, but needs further/more rigorous testing to show it is correct.

Screenshot 2024-06-23 at 20 27 36