NVIDIA cuQuantum support for QAOA using "lightning.gpu"

leolettuce commented 2 years ago

Expected behavior

We are simulating QAOA on an NVIDIA DGX system. Since the new pennylane version (v0.22) supports cuQuantum using the "lightning.gpu" device, we want to use it for potential speedups. (https://pennylane.ai/blog/2022/03/pennylane-v022-released/#accelerate-your-simulations-with-cuquantum-gpu-support)

Actual behavior

I installed the device and all necessary libraries. However, by simply replacing "default.qubit" or "qulacs.simulator" by "lightning.gpu", no optimization is happening. The cost_function stays constant close to zero.

As I have read that "lightning.gpu" works the best with diff_method set to "adjoint", I also tried that. However, then I got the following error message: "The MultiRZ operation is not supported using the "adjoint" differentiation method"

Additional information

Before, we were using pennylane version 0.19 and the "default.qubit" and "qulacs.simulator" device. The latter also has GPU support.

With the new version 0.22, I also realized that the simulation is significantly slower compared to the previous pennylane version.

For this github issue, I simply took the example problem of the QAOA tutorial to reproduce my problem. (https://pennylane.ai/qml/demos/tutorial_qaoa_intro.html)

Source code

import pennylane as qml
from pennylane import qaoa
from pennylane import numpy as np
from matplotlib import pyplot as plt
import networkx as nx

edges = [(0, 1), (1, 2), (2, 0), (2, 3)]
graph = nx.Graph(edges)

cost_h, mixer_h = qaoa.min_vertex_cover(graph, constrained=False)

def qaoa_layer(gamma, alpha):
    qaoa.cost_layer(gamma, cost_h)
    qaoa.mixer_layer(alpha, mixer_h)

wires = 4
depth = 2

def circuit(params, **kwargs):
    for w in range(wires):
        qml.Hadamard(wires=w)
    qml.layer(qaoa_layer, depth, params[0], params[1])

dev = qml.device("lightning.gpu", wires=range(wires))

cost_function = qml.ExpvalCost(circuit, cost_h, dev, optimize=True, diff_method="adjoint")

optimizer = qml.GradientDescentOptimizer()
steps = 70
params = np.array([[0.5, 0.5], [0.5, 0.5]], requires_grad=True)

for i in range(steps):
    params, cost_before = optimizer.step_and_cost(cost_function, params)
    print(f"Cost at step {i}: {cost_before}")

print("Optimal Parameters")
print(params)

@qml.qnode(dev)
def probability_circuit(gamma, alpha):
    circuit([gamma, alpha])
    return qml.probs(wires=range(wires))

probs_raw = probability_circuit(params[0], params[1])
indx = np.ndindex(*[2] * wires)
probs = {p: probs_raw[i] for i, p in enumerate(indx)}
best_bitstring = max(probs, key=probs.get)

print(f"Best bitstring: {best_bitstring} with prob: {probs[best_bitstring]}")

Tracebacks

Traceback (most recent call last):
  File "/home/q541472/dev/test/qaoa.py", line 33, in <module>
    params, cost_before = optimizer.step_and_cost(cost_function, params)
  File "/home/q541472/anaconda3/envs/quark_test/lib/python3.9/site-packages/pennylane/optimize/gradient_descent.py", line 100, in step_and_cost
    g, forward = self.compute_grad(objective_fn, args, kwargs, grad_fn=grad_fn)
  File "/home/q541472/anaconda3/envs/quark_test/lib/python3.9/site-packages/pennylane/optimize/gradient_descent.py", line 158, in compute_grad
    grad = g(*args, **kwargs)
  File "/home/q541472/anaconda3/envs/quark_test/lib/python3.9/site-packages/pennylane/_grad.py", line 113, in __call__
    grad_value, ans = grad_fn(*args, **kwargs)
  File "/home/q541472/anaconda3/envs/quark_test/lib/python3.9/site-packages/autograd/wrap_util.py", line 20, in nary_f
    return unary_operator(unary_f, x, *nary_op_args, **nary_op_kwargs)
  File "/home/q541472/anaconda3/envs/quark_test/lib/python3.9/site-packages/pennylane/_grad.py", line 131, in _grad_with_forward
    vjp, ans = _make_vjp(fun, x)
  File "/home/q541472/anaconda3/envs/quark_test/lib/python3.9/site-packages/autograd/core.py", line 10, in make_vjp
    end_value, end_node =  trace(start_node, fun, x)
  File "/home/q541472/anaconda3/envs/quark_test/lib/python3.9/site-packages/autograd/tracer.py", line 10, in trace
    end_box = fun(start_box)
  File "/home/q541472/anaconda3/envs/quark_test/lib/python3.9/site-packages/autograd/wrap_util.py", line 15, in unary_f
    return fun(*subargs, **kwargs)
  File "/home/q541472/anaconda3/envs/quark_test/lib/python3.9/site-packages/pennylane/vqe/vqe.py", line 206, in __call__
    return self.cost_fn(*args, **kwargs)
  File "/home/q541472/anaconda3/envs/quark_test/lib/python3.9/site-packages/pennylane/vqe/vqe.py", line 196, in cost_fn
    res = circuit(*qnode_args, obs=o, **qnode_kwargs)
  File "/home/q541472/anaconda3/envs/quark_test/lib/python3.9/site-packages/pennylane/qnode.py", line 578, in __call__
    res = qml.execute(
  File "/home/q541472/anaconda3/envs/quark_test/lib/python3.9/site-packages/pennylane/interfaces/batch/__init__.py", line 412, in execute
    res = _execute(
  File "/home/q541472/anaconda3/envs/quark_test/lib/python3.9/site-packages/pennylane/interfaces/batch/autograd.py", line 64, in execute
    return _execute(
  File "/home/q541472/anaconda3/envs/quark_test/lib/python3.9/site-packages/autograd/tracer.py", line 44, in f_wrapped
    ans = f_wrapped(*argvals, **kwargs)
  File "/home/q541472/anaconda3/envs/quark_test/lib/python3.9/site-packages/autograd/tracer.py", line 48, in f_wrapped
    return f_raw(*args, **kwargs)
  File "/home/q541472/anaconda3/envs/quark_test/lib/python3.9/site-packages/pennylane/interfaces/batch/autograd.py", line 108, in _execute
    res, jacs = execute_fn(tapes, **gradient_kwargs)
  File "/home/q541472/anaconda3/envs/quark_test/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/q541472/anaconda3/envs/quark_test/lib/python3.9/site-packages/pennylane/_device.py", line 537, in execute_and_gradients
    jacs.append(gradient_method(circuit, **kwargs))
  File "/home/q541472/anaconda3/envs/quark_test/lib/python3.9/site-packages/pennylane_lightning_gpu/lightning_gpu.py", line 271, in adjoint_jacobian
    self.adjoint_diff_support_check(tape)
  File "/home/q541472/anaconda3/envs/quark_test/lib/python3.9/site-packages/pennylane_lightning_gpu/lightning_gpu.py", line 245, in adjoint_diff_support_check
    raise QuantumFunctionError(
pennylane.QuantumFunctionError: The MultiRZ operation is not supported using the "adjoint" differentiation method

System information

Name: PennyLane
Version: 0.22.1
Summary: PennyLane is a Python quantum machine learning library by Xanadu Inc.
Home-page: https://github.com/XanaduAI/pennylane
Author: 
Author-email: 
License: Apache License 2.0
Location: /home/q541472/anaconda3/envs/quark_test/lib/python3.9/site-packages
Requires: pennylane-lightning, autograd, semantic-version, scipy, networkx, cachetools, toml, autoray, retworkx, numpy, appdirs
Required-by: pennylane-qulacs, PennyLane-Lightning, PennyLane-Lightning-GPU, amazon-braket-pennylane-plugin
Platform info:           Linux-5.4.0-100-generic-x86_64-with-glibc2.31
Python version:          3.9.7
Numpy version:           1.21.2
Scipy version:           1.7.3
Installed devices:
- braket.aws.qubit (amazon-braket-pennylane-plugin-1.5.2)
- braket.local.qubit (amazon-braket-pennylane-plugin-1.5.2)
- default.gaussian (PennyLane-0.22.1)
- default.mixed (PennyLane-0.22.1)
- default.qubit (PennyLane-0.22.1)
- default.qubit.autograd (PennyLane-0.22.1)
- default.qubit.jax (PennyLane-0.22.1)
- default.qubit.tf (PennyLane-0.22.1)
- default.qubit.torch (PennyLane-0.22.1)
- lightning.gpu (PennyLane-Lightning-GPU-0.22.1)
- lightning.qubit (PennyLane-Lightning-0.22.1)
- qulacs.simulator (pennylane-qulacs-0.16.0)

Existing GitHub issues

[X] I have searched existing GitHub issues to make sure the issue does not already exist.

josh146 commented 2 years ago

Hey @leolettuce! Thanks for alerting us to this.

While we dig into the problem here, I am wondering if forcing a decomposition for the MultiRZ gate could provide a workaround in the meantime?

As a small example,

custom_decomps={'MultiRZ': qml.MultiRZ.compute_decomposition}
dev = qml.device('lightning.gpu', wires=2, custom_decomps=custom_decomps)

@qml.qnode(dev, diff_method='adjoint')
def cost(theta):
    qml.Hadamard(wires=0)
    qml.Hadamard(wires=1)
    qml.MultiRZ(theta, wires=[1, 0])
    return qml.expval(qml.PauliX(1))

x = np.array(0.5, requires_grad=True)
cost(x)

leolettuce commented 2 years ago

Thank you! Yes the decomposition of the MultiRZ gate is resolving the error.

Nonetheless, I still have two open questions:

I want to run QAOA for a problem using 25 qubits. And as it is stated here: https://discuss.pennylane.ai/t/which-device-is-fastest/1774/3 the lightning.gpu device can be faster for simulations with more than 20 qubits. Now I tried to run the problem with lightning.gpu and the adjoint differentiation method. Unfortunately, I get the following error message:

[/pennylane-lightning-gpu/pennylane_lightning_gpu/src/simulator/StateVectorCudaBase.hpp][Line :246][Method:StateVectorCudaBase]: Error in PennyLane Lightning: out of memory

Is there a simple way to reduce the memory usage?

I am still wondering why the qulacs.simulator device behaves differently in different pennylane versions. I ran the small toy problem above with a circuit depth of 20 and wrote down the time, one iteration of QAOA takes. I used different simulators on several pennylane versions and got the following results:

Pennylane v0.22

lightning.gpu with diff_method="adjoint": 3.4 seconds
lightning.gpu with diff_method="best": 19.6 seconds
qulacs.simulator without gpu: 21.5 seconds
qulacs.simulator with gpu: 22.7 seconds
default.qubit: 0.2 seconds

Pennylane v0.21

qulacs.simulator without gpu: 31.9 seconds
qulacs.simulator with gpu: 32.3 seconds
default.qubit: 0.2 seconds

Pennylane v0.20

qulacs.simulator without gpu: 44 seconds
qulacs.simulator with gpu: 33 seconds
default.qubit: 0.2 seconds

Pennylane v0.19

qulacs.simulator without gpu: 7.4 seconds
qulacs.simulator with gpu: 9.2 seconds
default.qubit: 0.2 seconds

I would have expected that the qulacs.simulator device would yield similar results on all pennylane versions. I also experienced this behavior on larger problems. Is there a known explanation for that?

mlxd commented 2 years ago

Hi @leolettuce thanks for the update. For the performance differences you've shown, this is likely due to a similar cause as https://github.com/PennyLaneAI/pennylane/issues/2430#issuecomment-1092880807 . Namely, to ensure better usage of quantum resources, we make use of more up-front classical processing in PennyLane v0.20 and above, as this allows us to support n-th order gradients relatively easily. The Qulacs device is also taking advantage of the parameter-shift method for gradients, which can have a high cost for large circuits with several parameters. We are currently addressing some of the additional costs associated with improving the quantum resource uses and additional classical overheads.

As for the question about memory usage, this is also a challenging one. Running large problems on lightning.gpu can depend on a number of factors, one of which is the available RAM on the given GPU. For V100s, whether this is either 16GB or 32GB can make a big factor on whether the circuit runs. Similarly, for A100, having a 40 or 80GB version is the same, dependent upon the problem at hand. Due to how memory is allocated by intermediate library calls, it can be difficult to predict up-front whether a problem will fit. Since you have a DGX box to access, you can always use lightning.qubit with diff_method=adjoint, and can control the number of concurrent expectation value calculations with OMP_NUM_THREADS as mentioned bottom of the page here.

Feel free to provide a minimum working example of QAOA though if you would like us to investigate this further; there may be some optimizations we can provide based on your work-load needs.

PennyLaneAI / pennylane