Memory profiling of `qml.state()`

Benchmarking

I have performed benchmark with the following setup:

Python version: 3.12
Pennylane version (from pip install): 0.36.0
Pennylane_Lightning version (from pip install): 0.36.0
Hardware: AMD EPYC 7543P I timed the circuit 50 times (after a warmup), and the timings (in seconds) are as follows:

num_wires	default	lightning
5	0.02415302	0.01594006
6	0.02579262	0.01603134
7	0.02741707	0.01615754
8	0.02954822	0.01612877
9	0.03091021	0.01626475
10	0.03338226	0.01591212
11	0.03487502	0.01638795
12	0.03691649	0.01649593
13	0.03912803	0.01675916
14	0.04284222	0.01769783
15	0.04721544	0.01946455
16	0.05462405	0.02161267
17	0.06258643	0.02764239
18	0.07848529	0.08313
19	0.2460156	0.12398331
20	0.21531559	0.24791661
21	0.71412075	0.50003818
22	1.45517968	1.02306128
23	2.99782932	2.01809743
24	5.91070788	3.93206317
25	11.6924843	7.68381341

Across the range of num_wires , lightning.qubit device runs faster default.qubit (I did not seem to observe default.qubit being faster than lightning.qubit)

Profiling

I chose to use 2 profilers:

Linaro MAP: excellent interface and provides comprehensive metrics. I have experience with this in the past, and found it extremely useful. This tool requires a license, and is available on the system under test. On systems without the license, I used intel vTune in the past for profiling
Memray: I recently tried memray, which is a python-specific memory profiler, and found the UI to be intuitive and provide memory reports very quickly. I wanted to use this tool to first identify the main locations for memory allocation:

Here I used a larger num_wires=27 to help identity the memory allocations. We first look at the profiling from memray (lightning.qubit) which shows the calls with the largest memory allocation. (I repeated the circuit twice as seen in the diagram, and focus only on the first.) Screenshot 2024-06-23 at 18 15 28

Memory allocation

From the call stack we can see three distinct phases of memory allocation.

The initial state-vector is allocated in memory in C++ during dev = qml.device(device_name, wires=num_wires , which uses pybind to call the allocation functions. This is not related to the circuit/qml.state() .
Within the circuit qml.state(), when the measurement is performed in state_diagonalizing_gates:
1. At state_array = self._qubit_state.state in _measurement.py , a new numpy array is created in memory (np.zeros in _state_vector.py ) before the data is copied from the C++ array to the numpy array (via self._qubit_state.getState(state) in _state_vector.py).
2. At result = measurementprocess.process_state(state_array, wires) in _measurements.py, this calls Pennylane's process_state in state.py. At return qml.math.cast(state, "complex128") if is_tf_interface else state + 0.0j , it returns the state state + 0.0j. This creates an extra (in-theory temporary) copy of the state in numpy.
During the application, the state-vector in the C++ memory buffer is:
- allocated using https://github.com/PennyLaneAI/pennylane-lightning/blob/master/pennylane_lightning/core/src/simulators/lightning_qubit/bindings/LQubitBindings.hpp#L177
- copied to numpy array once during measurement using https://github.com/PennyLaneAI/pennylane-lightning/blob/master/pennylane_lightning/core/src/simulators/lightning_qubit/bindings/LQubitBindings.hpp#L200

The python-binding to these C++ state-vector manipulations are from self._qubit_state in this class here: https://github.com/PennyLaneAI/pennylane-lightning/blob/master/pennylane_lightning/lightning_qubit/_state_vector.py#L41 ; this is used to call the methods to create/read/update the state-vector in C++ memory buffer from python.

Each of the copies of state vectors is about 2GB (2^27qubits * 128b complex = 2GB), and with the 3 copies created from above, explains the peak usage at ~6.51GB.

Comparing the memory footprint to the pure python default.qubit implementation, in default.qubit there isn't an extra copy in C++, and a new copy of the array is not created at return qml.math.cast(state, "complex128") if is_tf_interface else state + 0.0j , which results in a much lower memory footprint (at ~3.28 GB):

Screenshot 2024-06-23 at 20 52 03

Runtime cost

Back to lightning.qubit, in terms of the timing cost of the memory operations, we can look at the MAP profiler result:! Screenshot 2024-06-23 at 19 01 53

This confirms that:

A significant amount of runtime is spent on initializing the device (however this is not relevant here)
23.5% of time spent on copying the C++ state-vector to python (self._qubit_state.getState(state), right after the python vector is created with np.zero)
17.6% of time is spent on array_add coming from return qml.math.cast(state, "complex128") if is_tf_interface else state + 0.0j

Bottlenecks

The latter two points above might be improved.

In terms of copying the C++ state-vector to python numpy, this may not be necessary. For this circuit, since no further gates are applied before returning the state, there is no operations before the copy. And if there is no need for an explicit copy in python, we can simply expose the C++ array by creating a view in python, without copying it like in https://github.com/PennyLaneAI/pennylane-lightning/blob/master/pennylane_lightning/core/src/simulators/lightning_qubit/bindings/LQubitBindings.hpp#L206 . It might be beneficial to have both a copy and a view method to improve general memory management.

In terms of the the array_add from the last point, in this case it is unclear why state + 0.0j is returned instead of state (assuming the initial state is the correct complex datatype).

Possible improvement

By returning state instead of state + 0.0j, it means that there is no need for a new temporary copy of the state-vector in numpy. This results in a lower memory consumption (~4.36GB). From quick testing this seems to produce identical result, but needs further/more rigorous testing to show it is correct.

Screenshot 2024-06-23 at 20 27 36

PennyLaneAI / pennylane-lightning