Memory allocation issues with jupyter

Kenny-Heitritter commented 1 year ago

Issue description

Description of the issue - include code snippets and screenshots here if relevant. You may use the following template below

When using Pennylane lightning GPU plugin on qBraid Lab, which uses jupyterlab notebooks, an unexpectedly large amount of RAM is utilized which results in an out-of-memory error for relatively small (14 qubit) simulations. The system being used has 4 cores, 13.5GB of RAM, and an Nvidia T4 (16GB VRAM). This issue is non-existent when utilizing CuQuantum directly.

Expected behavior: Pennylane GPU simulation should utilize only as much RAM as is required to store and transfer state-vector/unitaries to/from GPU.
Actual behavior: Pennylane GPU simulation utilizes much more RAM than is required, thus resulting in inability to simulate anything but small circuits.
Reproduces how often: Always.
System information: Name: PennyLane Version: 0.31.0 Summary: PennyLane is a Python quantum machine learning library by Xanadu Inc. Home-page: https://github.com/PennyLaneAI/pennylane Author: Author-email: License: Apache License 2.0 Location: /home/jovyan/.qbraid/environments/pennylane_gpu_21uik3/pyenv/lib/python3.9/site-packages Requires: appdirs, autograd, autoray, cachetools, networkx, numpy, pennylane-lightning, requests, rustworkx, scipy, semantic-version, toml Required-by: PennyLane-Lightning, PennyLane-Lightning-GPU

Platform info: Linux-5.4.247-162.350.amzn2.x86_64-x86_64-with-glibc2.31 Python version: 3.9.12 Numpy version: 1.23.5 Scipy version: 1.10.0 Installed devices:

default.gaussian (PennyLane-0.31.0)
default.mixed (PennyLane-0.31.0)
default.qubit (PennyLane-0.31.0)
default.qubit.autograd (PennyLane-0.31.0)
default.qubit.jax (PennyLane-0.31.0)
default.qubit.tf (PennyLane-0.31.0)
default.qubit.torch (PennyLane-0.31.0)
default.qutrit (PennyLane-0.31.0)
null.qubit (PennyLane-0.31.0)
lightning.qubit (PennyLane-Lightning-0.31.0)
lightning.gpu (PennyLane-Lightning-GPU-0.31.0)

Source code and tracebacks

Notebook used on qBraid Lab with GPU instance is attached.

Additional information

pennylane_gpu_bug.zip

mlxd commented 1 year ago

Hey @Kenny-Heitritter thanks for the report.

Having a look at the above code, and trying it out on a 24GB card, I do not think the issue is with memory allocations, but with an access issue through the custatevec API. Everything in PennyLane is wrapped with safety -- we ensure any-time a problem is hit in a lower layer that an exception is thrown up to the running Python process. This allows us to easily see any issues that crops up. In this case, tracking CUDA memory allocations isn't something natively supported by tracemalloc or others, and without directly jumping into using CUPTI, it can be hard to identify who allocates what and where, as the CUDA driver vs runtime tend to be separated, and allocate individually. For this issue, it seems the Python process is killed, which for a memory allocation error would raise an exception instead, which leads me to think the problem is a C-API call made into custatevec from our bindings.

Long-story-short, it seems the issue here is with the Tensor type, and the bound call from the C++ binding code onwards to custatevec. If we convert the Tensor type to a Hamiltonian with a coefficient of 1.0, everything passes just fine. You can try this with:

import pennylane as qml
from pennylane import numpy as np
from functools import reduce

n_wires = 15
dev_gpu = qml.device("lightning.gpu", wires=n_wires)

observables =  [
        qml.Identity(i) for i in range(n_wires - 1)
    ] + [qml.PauliZ(n_wires-1)]

def t_prod(op1, op2):
    return op1 @ op2

def createH(ops):
    return 1.0*reduce(t_prod, ops)

@qml.qnode(dev_gpu)
def circuit(params):
    for i in range(n_wires):
        qml.RX(params[i], wires=i)
    for i in range(n_wires):
        qml.CNOT(wires=[i, (i + 1) % n_wires])

    # Measure all qubits
    return qml.expval(createH(observables))

params = np.random.random(n_wires)
circuit(params)

If you remove the 1.0, it returns to being a Tensor type, and the issue will return. I will aim to identify where this issue crops up and get a fix in soon.

Also, regarding memory usage, since lightning.gpu is focused for PennyLane work-loads, we do tend to request more GPU memory than a raw call to the custatevec APIs. This is with good reason: for example, we define a series of gate buffers that are already on the device, allowing us to circumvent using the custatevec host-device transfers for gates not natively offered by the interface. We observed this gives much better performance with the removal of host-device transfers, and so it likwly will show up if querying GPU memory usage. PL also assumes double precision throughout, and tends to focus workloads around this, from Python all the way through to the CPP bindings. FP32 is supported, but will need additional configuration options to the device to ensure correct alignment or memory transfers. Since we also expect lightning.gpu to be used for gradient-based workloads, the adjoint differentiation pipeline in PennyLane does require additional GPU buffers too to ensure enough memory is available for execution --- again, this can reduce overall qubit count by 2 for a single observable, and the reduction increases logarithmically with the number of additional observables.

If you need to use this pipeline, you can restrict the number of concurrent statevector copies by using the comments from "Parallel adjoint differentiation support" in the docs page at https://docs.pennylane.ai/projects/lightning-gpu/en/latest/devices.html

mlxd commented 1 year ago

As a trial, I ran the above with a 30 qubit payload on a 24GB device, and still never hit the memory wall. Running for 31 gives:

Traceback (most recent call last):
  File "/home/ubuntu/PL_GPU_Issue130/pennylane-lightning-gpu/rundir/./br2.py", line 6, in <module>
    dev_gpu = qml.device("lightning.gpu", wires=n_wires)
  File "/home/ubuntu/PL_GPU_Issue130/pennylane/pennylane/__init__.py", line 337, in device
    dev = plugin_device_class(*args, **options)
  File "/home/ubuntu/PL_GPU_Issue130/pennylane-lightning-gpu/pennylane_lightning_gpu/lightning_gpu.py", line 258, in __init__
    self._gpu_state = _gpu_dtype(c_dtype)(self._num_local_wires)
pennylane_lightning_gpu.lightning_gpu_qubit_ops.PLException: [/home/ubuntu/PL_GPU_Issue130/pennylane-lightning-gpu/pennylane_lightning_gpu/src/util/DataBuffer.hpp][Line:37][Method:DataBuffer]: Error in PennyLane Lightning: out of memory

which is what a memory allocation error would be reported as.

mlxd commented 1 year ago

Did you receive a similar exception as above, or was it a killed process?

Kenny-Heitritter commented 1 year ago

Hi @mlxd and thanks for the speedy troubleshooting!

Your provided alternative code does appear to function as expected (successfully ran up to ~28 qubits) and throws an exception when trying to over allocate vram. Removing the coefficient of 1.0 oncreateH also results in the previously described issue where the python process is killed. So it seems the issue is with the tensor type, as you mentioned.

mlxd commented 1 year ago

Great --- at least we know where the issue appears. I'll take a look around in the coming days and get a fix in for this. Thanks again for the report --- it's always great when example code is provided!

PennyLaneAI / pennylane-lightning-gpu