Hello, I want to test my GH200 (Grace-Hopper by Nvidia) by executing a simulation of a "heavy-weight" quantum algorithm with multiple qubits, and I thought PennyLane-Lightning could be a great tool for this. I want to run the simulation via a docker, but I a saw you have archived the repo of [pennylane-lightning-gpu]. So what do you recommend I should do?

Hi @giladqm

The repo https://github.com/PennyLaneAI/pennylane-lightning-gpu and all of its contents were migrated to https://github.com/PennyLaneAI/pennylane-lightning so all of the lightning.gpu device components can now be installed from this repository. For aarch64 system, we do not currently provide wheels through PyPI, but there are multiple options to install the package (see https://pennylane.ai/install/#high-performance-computing-and-gpus for more details)

Install from Conda-Forge: lightning.gpu for aarch64 is available through Conda-Forge as
```
conda install pennylane-lightning-gpu
```
Though, this package is provided mostly with community support, and will likely not support all other standard devices (e.g. you'll need to manually install lightning.qubit and other packages too).

Build from source: this should work from the docker container you are also discussing here, assuming you have nvcc and supporting libraries installed. For simplicity, we can first install pennylane and pennylane-lightning, and all required build dependencies, then build and install pennylane-lightning-gpu as:

python -m venv pyenv && source ./pyenv/bin/activate
python -m pip install pennylane
git clone https://github.com/PennyLaneAI/pennylane-lightning --branch latest_release --single-branch
cd pennylane-lightning

# requirements-dev.txt does not have wheels for all packages so we can explicitly list these out
python -m pip install cmake ninja custatevec_cu12 pip~=22.0 
PL_BACKEND="lightning_gpu" python -m pip install . --verbose

The package should build and install a natively built version of the libraries into your python environment.

Feel free to let us know if the above doesn't work. Shipping PyPI wheels for aarch64 is on our roadmap, but we have no current timeline to provide yet.

This is what I get: (pyenv) gilad@gracehopper:~/pennylane-lightning$ PL_BACKEND="lightning_gpu" python -m pip install . --verbose Using pip 22.0.2 from /home/gilad/pyenv/lib/python3.10/site-packages/pip (python 3.10) Processing /home/gilad/pennylane-lightning Running command python setup.py egg_info running egg_info creating /tmp/pip-pip-egg-info-f0k0xpmr/PennyLane_Lightning_GPU.egg-info writing /tmp/pip-pip-egg-info-f0k0xpmr/PennyLane_Lightning_GPU.egg-info/PKG-INFO writing dependency_links to /tmp/pip-pip-egg-info-f0k0xpmr/PennyLane_Lightning_GPU.egg-info/dependency_links.txt writing entry points to /tmp/pip-pip-egg-info-f0k0xpmr/PennyLane_Lightning_GPU.egg-info/entry_points.txt writing requirements to /tmp/pip-pip-egg-info-f0k0xpmr/PennyLane_Lightning_GPU.egg-info/requires.txt writing top-level names to /tmp/pip-pip-egg-info-f0k0xpmr/PennyLane_Lightning_GPU.egg-info/top_level.txt writing manifest file '/tmp/pip-pip-egg-info-f0k0xpmr/PennyLane_Lightning_GPU.egg-info/SOURCES.txt' reading manifest file '/tmp/pip-pip-egg-info-f0k0xpmr/PennyLane_Lightning_GPU.egg-info/SOURCES.txt' reading manifest template 'MANIFEST.in' warning: no files found matching 'pennylane_lightning/lightning_qpu/lightning_gpu.toml' adding license file 'LICENSE' writing manifest file '/tmp/pip-pip-egg-info-f0k0xpmr/PennyLane_Lightning_GPU.egg-info/SOURCES.txt' Preparing metadata (setup.py) ... done Requirement already satisfied: pennylane>=0.34 in /home/gilad/pyenv/lib/python3.10/site-packages (from PennyLane-Lightning-GPU==0.36.0) (0.36.0) Requirement already satisfied: pennylane_lightning==0.36.0 in /home/gilad/pyenv/lib/python3.10/site-packages (from PennyLane-Lightning-GPU==0.36.0) (0.36.0) Requirement already satisfied: numpy<2.0 in /home/gilad/pyenv/lib/python3.10/site-packages (from pennylane>=0.34->PennyLane-Lightning-GPU==0.36.0) (1.26.4) Requirement already satisfied: rustworkx in /home/gilad/pyenv/lib/python3.10/site-packages (from pennylane>=0.34->PennyLane-Lightning-GPU==0.36.0) (0.14.2) Requirement already satisfied: semantic-version>=2.7 in /home/gilad/pyenv/lib/python3.10/site-packages (from pennylane>=0.34->PennyLane-Lightning-GPU==0.36.0) (2.10.0) Requirement already satisfied: typing-extensions in /home/gilad/pyenv/lib/python3.10/site-packages (from pennylane>=0.34->PennyLane-Lightning-GPU==0.36.0) (4.12.0) Requirement already satisfied: autoray>=0.6.1 in /home/gilad/pyenv/lib/python3.10/site-packages (from pennylane>=0.34->PennyLane-Lightning-GPU==0.36.0) (0.6.12) Requirement already satisfied: networkx in /home/gilad/pyenv/lib/python3.10/site-packages (from pennylane>=0.34->PennyLane-Lightning-GPU==0.36.0) (3.3) Requirement already satisfied: scipy in /home/gilad/pyenv/lib/python3.10/site-packages (from pennylane>=0.34->PennyLane-Lightning-GPU==0.36.0) (1.13.1) Requirement already satisfied: autograd in /home/gilad/pyenv/lib/python3.10/site-packages (from pennylane>=0.34->PennyLane-Lightning-GPU==0.36.0) (1.6.2) Requirement already satisfied: requests in /home/gilad/pyenv/lib/python3.10/site-packages (from pennylane>=0.34->PennyLane-Lightning-GPU==0.36.0) (2.32.2) Requirement already satisfied: toml in /home/gilad/pyenv/lib/python3.10/site-packages (from pennylane>=0.34->PennyLane-Lightning-GPU==0.36.0) (0.10.2) Requirement already satisfied: appdirs in /home/gilad/pyenv/lib/python3.10/site-packages (from pennylane>=0.34->PennyLane-Lightning-GPU==0.36.0) (1.4.4) Requirement already satisfied: cachetools in /home/gilad/pyenv/lib/python3.10/site-packages (from pennylane>=0.34->PennyLane-Lightning-GPU==0.36.0) (5.3.3) Requirement already satisfied: future>=0.15.2 in /home/gilad/pyenv/lib/python3.10/site-packages (from autograd->pennylane>=0.34->PennyLane-Lightning-GPU==0.36.0) (1.0.0) Requirement already satisfied: idna<4,>=2.5 in /home/gilad/pyenv/lib/python3.10/site-packages (from requests->pennylane>=0.34->PennyLane-Lightning-GPU==0.36.0) (3.7) Requirement already satisfied: certifi>=2017.4.17 in /home/gilad/pyenv/lib/python3.10/site-packages (from requests->pennylane>=0.34->PennyLane-Lightning-GPU==0.36.0) (2024.2.2) Requirement already satisfied: urllib3<3,>=1.21.1 in /home/gilad/pyenv/lib/python3.10/site-packages (from requests->pennylane>=0.34->PennyLane-Lightning-GPU==0.36.0) (2.2.1) Requirement already satisfied: charset-normalizer<4,>=2 in /home/gilad/pyenv/lib/python3.10/site-packages (from requests->pennylane>=0.34->PennyLane-Lightning-GPU==0.36.0) (3.3.2) Using legacy 'setup.py install' for PennyLane-Lightning-GPU, since package 'wheel' is not installed. Installing collected packages: PennyLane-Lightning-GPU Running command Running setup.py install for PennyLane-Lightning-GPU running install /home/gilad/pyenv/lib/python3.10/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools. warnings.warn( running build running build_py running egg_info writing PennyLane_Lightning_GPU.egg-info/PKG-INFO writing dependency_links to PennyLane_Lightning_GPU.egg-info/dependency_links.txt writing entry points to PennyLane_Lightning_GPU.egg-info/entry_points.txt writing requirements to PennyLane_Lightning_GPU.egg-info/requires.txt writing top-level names to PennyLane_Lightning_GPU.egg-info/top_level.txt reading manifest file 'PennyLane_Lightning_GPU.egg-info/SOURCES.txt' reading manifest template 'MANIFEST.in' warning: no files found matching 'pennylane_lightning/lightning_qpu/lightning_gpu.toml' adding license file 'LICENSE' writing manifest file 'PennyLane_Lightning_GPU.egg-info/SOURCES.txt' running build_ext ░█░░░▀█▀░█▀▀░█░█░▀█▀░█▀█░▀█▀░█▀█░█▀▀░ ░█░░░░█░░█░█░█▀█░░█░░█░█░░█░░█░█░█░█░ ░▀▀▀░▀▀▀░▀▀▀░▀░▀░░▀░░▀░▀░▀▀▀░▀░▀░▀▀▀░

-- pennylane_lightning version 0.36.0 -- Found OpenMP_CXX: -fopenmp (found version "4.5") -- PL_BACKEND: lightning_gpu -- ENABLE_WARNINGS is OFF. -- ENABLE_OPENMP is ON. -- Found OpenMP_CXX: -fopenmp (found version "4.5") -- Python scipy-lib path: /home/gilad/pyenv/lib/python3.10/site-packages/scipy.libs -- pybind11 v2.11.1 Python site-packages directory: /home/gilad/pyenv/lib/python3.10/site-packages ░█░░░▀█▀░█▀▀░█░█░▀█▀░█▀█░▀█▀░█▀█░█▀▀░░░░█▀▀░█▀█░█░█ ░█░░░░█░░█░█░█▀█░░█░░█░█░░█░░█░█░█░█░░░░█░█░█▀▀░█░█ ░▀▀▀░▀▀▀░▀▀▀░▀░▀░░▀░░▀░▀░▀▀▀░▀░▀░▀▀▀░▀░░▀▀▀░▀░░░▀▀▀

CMake Error at /home/gilad/pyenv/lib/python3.10/site-packages/cmake/data/share/cmake-3.29/Modules/Internal/CMakeCUDAArchitecturesValidate.cmake:7 (message): CMAKE_CUDA_ARCHITECTURES must be non-empty if set. Call Stack (most recent call first): /home/gilad/pyenv/lib/python3.10/site-packages/cmake/data/share/cmake-3.29/Modules/CMakeDetermineCUDACompiler.cmake:112 (cmake_cuda_architectures_validate) pennylane_lightning/core/src/simulators/lightning_gpu/CMakeLists.txt:9 (project)

-- Configuring incomplete, errors occurred! Traceback (most recent call last): File "", line 2, in File "", line 34, in File "/home/gilad/pennylane-lightning/setup.py", line 250, in setup(classifiers=classifiers, (info)) File "/home/gilad/pyenv/lib/python3.10/site-packages/setuptools/init.py", line 153, in setup return distutils.core.setup(attrs) File "/usr/lib/python3.10/distutils/core.py", line 148, in setup dist.run_commands() File "/usr/lib/python3.10/distutils/dist.py", line 966, in run_commands self.run_command(cmd) File "/usr/lib/python3.10/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/home/gilad/pyenv/lib/python3.10/site-packages/setuptools/command/install.py", line 68, in run return orig.install.run(self) File "/usr/lib/python3.10/distutils/command/install.py", line 619, in run self.run_command('build') File "/usr/lib/python3.10/distutils/cmd.py", line 313, in run_command self.distribution.run_command(command) File "/usr/lib/python3.10/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/usr/lib/python3.10/distutils/command/build.py", line 135, in run self.run_command(cmd_name) File "/usr/lib/python3.10/distutils/cmd.py", line 313, in run_command self.distribution.run_command(command) File "/usr/lib/python3.10/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/home/gilad/pyenv/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 79, in run _build_ext.run(self) File "/usr/lib/python3.10/distutils/command/build_ext.py", line 340, in run self.build_extensions() File "/usr/lib/python3.10/distutils/command/build_ext.py", line 449, in build_extensions self._build_extensions_serial() File "/usr/lib/python3.10/distutils/command/build_ext.py", line 474, in _build_extensions_serial self.build_extension(ext) File "/home/gilad/pennylane-lightning/setup.py", line 158, in build_extension subprocess.check_call( File "/usr/lib/python3.10/subprocess.py", line 369, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['cmake', '/home/gilad/pennylane-lightning', '-DCMAKE_LIBRARY_OUTPUT_DIRECTORY=/home/gilad/pennylane-lightning/build/lib.linux-aarch64-3.10/pennylane_lightning', '-DCMAKE_BUILD_TYPE=RelWithDebInfo', '-DENABLE_WARNINGS=OFF', '-DPYTHON_EXECUTABLE=/home/gilad/pyenv/bin/python', '-GNinja', '-DCMAKE_MAKE_PROGRAM=/home/gilad/pyenv/bin/ninja', '-DPL_BACKEND=lightning_gpu']' returned non-zero exit status 1. error: subprocess-exited-with-error

× Running setup.py install for PennyLane-Lightning-GPU did not run successfully. │ exit code: 1 ╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip. full command: /home/gilad/pyenv/bin/python -u -c ' exec(compile('"'"''"'"''"'"'

This is -- a caller that pip uses to run setup.py

- It imports setuptools before invoking setup.py, to enable projects that directly

import from `distutils.core` to work with newer packaging standards.

- It provides a clear error message when setuptools is not installed.

- It sets `sys.argv[0]` to the underlying `setup.py`, when invoking `setup.py` so

setuptools doesn'"'"'t think the script is `-c`. This avoids the following warning:

manifest_maker: standard file '"'"'-c'"'"' not found".

- It generates a shim setup.py, for handling setup.cfg-only projects.

import os, sys, tokenize

try: import setuptools except ImportError as error: print( "ERROR: Can not execute setup.py since setuptools is not available in " "the build environment.", file=sys.stderr, ) sys.exit(1)

file = %r sys.argv[0] = file

if os.path.exists(file): filename = file with tokenize.open(file) as f: setup_py_code = f.read() else: filename = "" setup_py_code = "from setuptools import setup; setup()"

exec(compile(setup_py_code, filename, "exec")) '"'"''"'"''"'"' % ('"'"'/home/gilad/pennylane-lightning/setup.py'"'"',), "", "exec"))' install --record /tmp/pip-record-6lmxypxj/install-record.txt --single-version-externally-managed --compile --install-headers /home/gilad/pyenv/include/site/python3.10/PennyLane-Lightning-GPU cwd: /home/gilad/pennylane-lightning/ Running setup.py install for PennyLane-Lightning-GPU ... error error: legacy-install-failure

× Encountered error while trying to install package. ╰─> PennyLane-Lightning-GPU

note: This is an issue with the package mentioned above, not pip. hint: See above for output from the failure.

Hi @giladqm

Are you running this on the Nvidia cuQuantum appliance docker image? I have verified this with docker run --platform aarch64 --rm -it nvcr.io/nvidia/cuquantum-appliance:24.03-arm64 to spin up the container locally, and then I have run exactly:

python -m venv pyenv && source ./pyenv/bin/activate
python -m pip install pennylane
git clone https://github.com/PennyLaneAI/pennylane-lightning --branch latest_release --single-branch
cd pennylane-lightning

# requirements-dev.txt does not have wheels for all packages so we can explicitly list these out
python -m pip install cmake ninja custatevec_cu12 pip~=22.0 
PL_BACKEND="lightning_gpu" python -m pip install . --verbose

and the installation completes successfully. Is there some other environment or custom modifications you have made to your environment, or are you working on a different container image than nvcr.io/nvidia/cuquantum-appliance:24.03-arm64?

If you are using a different env, there may be missing packages --- in this instance, it looks like setuptools isn't available in your environment. I'd recommend installing this package, since it is likely the cause of the failure in your env. Let us know if this helps.

I followed you instructions and indeed the installation was successful. Unfortunately it looks like the GPU isn't being used. This is the code I'm running:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # Specify the index of the GPU you want to use

import time
import logging
import torch
import pennylane as qml
from matplotlib import pyplot as plt
from pennylane import numpy as np

# Define the directory to save outputs
output_dir = "output"
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# Set up logging
log_file_path = os.path.join(output_dir, "output_log.txt")
logging.basicConfig(filename=log_file_path, level=logging.INFO, 
                    format='%(asctime)s %(message)s', filemode='w')
console = logging.StreamHandler()
console.setLevel(logging.INFO)
formatter = logging.Formatter('%(asctime)s %(message)s')
console.setFormatter(formatter)
logging.getLogger('').addHandler(console)

def circuit0_basic(params, wires):
    n_qubits = len(wires)
    n_rotations = len(params)
    if n_rotations > 1:
        n_layers = n_rotations // n_qubits
        n_extra_rots = n_rotations - n_layers * n_qubits

        for layer_idx in range(n_layers):
            layer_params = params[layer_idx * n_qubits: layer_idx * n_qubits + n_qubits, :]
            qml.broadcast(qml.Rot, wires, pattern="single", parameters=layer_params)
            qml.broadcast(qml.CNOT, wires, pattern="ring")

        extra_params = params[-n_extra_rots:, :]
        extra_wires = wires[: n_qubits - 1 - n_extra_rots: -1]
        qml.broadcast(qml.Rot, extra_wires, pattern="single", parameters=extra_params)
    else:
        qml.Rot(*params[0], wires=wires[0])

def circuit0(params, wires):
    n_qubits = len(wires)
    n_rotations = len(params)

    if n_rotations > 1:
        n_layers = n_rotations // n_qubits
        n_extra_rots = n_rotations - n_layers * n_qubits

        for layer_idx in range(n_layers):
            layer_params = params[layer_idx * n_qubits: layer_idx * n_qubits + n_qubits, :]
            qml.broadcast(qml.Rot, wires, pattern="single", parameters=layer_params)
            qml.broadcast(qml.CNOT, wires, pattern="ring")

        extra_params = params[-n_extra_rots:, :]
        extra_wires = wires[: n_qubits - 1 - n_extra_rots: -1]
        qml.broadcast(qml.Rot, extra_wires, pattern="single", parameters=extra_params)
    else:
        qml.Rot(*params[0], wires[wires[0]])

def runtest(H, cfg, show=False):
    max_iterations = cfg.max_iterations
    num_qubits = len(H.wires)
    num_param_sets = (2 ** num_qubits) - 1

    params = np.random.uniform(low=-np.pi / 2, high=np.pi / 2, size=(num_param_sets, 3))
    params = np.array(params, requires_grad=True)

    dev = qml.device("lightning.gpu", wires=num_qubits, batch_obs=1)
    logging.info(f"Using device: {dev}")

    @qml.qnode(dev, interface='autograd')
    def cost_fn(params):
        circuit0_basic(params, wires=H.wires)
        return qml.expval(H)

    @qml.qnode(dev, interface='autograd')
    def state_fn(params):
        circuit0_basic(params, wires=H.wires)
        return qml.state()

    # Check if GPU is available and log it after device is set
    if torch.cuda.is_available():
        logging.info("CUDA is available. Using GPU.")
    else:
        logging.info("CUDA is not available. Using CPU.")
        logging.info("CUDA check details:")
        logging.info(f"CUDA available: {torch.cuda.is_available()}")
        logging.info(f"CUDA device count: {torch.cuda.device_count()}")
        if torch.cuda.is_available():
            logging.info(f"CUDA device name: {torch.cuda.get_device_name(0)}")

    opt = qml.AdamOptimizer(stepsize=0.1)
    conv_tol = 1e-06
    energy_plot = []
    prev_energy = cost_fn(params)

    for n in range(max_iterations):
        params, energy = opt.step_and_cost(cost_fn, params)
        logging.info(f"Iteration {n + 1}/{max_iterations}: Energy = {energy:.6f}")
        energy_plot.append(energy)

        if np.abs(energy - prev_energy) <= conv_tol:
            logging.info("Convergence reached!")
            break
        prev_energy = energy

        # Print progress
        progress = (n + 1) / max_iterations * 100
        logging.info(f"Progress: {progress:.2f}%")

    logging.info("Optimization completed.")

    # Using the params, find the ground state vector
    best_params = params
    ground_state = state_fn(best_params)

    # Plot energies
    plt.clf()
    plt.plot(energy_plot)
    plt.xlabel("Iterations")
    plt.ylabel("Energy")
    plt.title("Energy at each iteration")
    energy_plot_path = os.path.join(output_dir, "energy_plot.png")
    plt.savefig(energy_plot_path)
    if show:
        plt.show()

    return energy, ground_state

class Params0:
    pass

class Hamiltonian:

    def __init__(self, N):
        self.ham = qml.Hamiltonian([], [])
        self.energies = None
        self.states = None
        self.gs_energy = None
        self.gs_state = None

        self.N = N  # Number of qubits

class MyHamiltonian0(Hamiltonian):

    def __init__(self, N, A, b, P=1, flag0=False):
        n = N // P
        super().__init__(N)
        self.flag0 = flag0
        self.P = P
        self.n = n
        self.A = A
        self.b = b
        self.y = self.A @ self.b.reshape(-1, 1)
        self.y = np.matrix(self.y)
        self.set_hamiltonian()

    def get_evec(self):
        return self.noise_std * np.random.normal(size=self.n, requires_grad=False)

    def set_hamiltonian(self):

        def func1():
            param1, param2 = self.A.shape
            w1 = np.zeros((param2, param2))
            w2 = np.zeros(param2)
            for m in range(param1):
                for i in range(param2):
                    w2[i] += -2 * self.A[m, i] * self.y[m, 0]
                    for j in range(param2):
                        w1[i, j] += self.A[m, i] * self.A[m, j]
            return w1, w2

        def func2():
            w1, w2 = func1()
            min_val = -2 ** (self.P - 1) + 1
            param1, param2 = w1.shape
            v1 = np.zeros((param2 * self.P, param2 * self.P))
            v2 = np.zeros(param2 * self.P)
            for i in range(param1):
                for s in range(self.P):
                    v2[self.P * i + s] += (2 ** s) * w2[i]
                    for j in range(param2):
                        v2[self.P * i + s] += (2 ** s) * 2 * min_val * w1[i, j]
                        for p in range(self.P):
                            v1[self.P * i + s, self.P * j + p] += (2 ** (s + p)) * w1[i, j]
            return v1, v2

        v1, v2 = func2()

        H = qml.Hamiltonian([], [])
        for i in range(self.n):
            for s in range(self.P):
                xadded = False
                x = i * self.P + s
                fact = - (sum(v1[x, :]) + sum(v1[:, x]) + 2*v2[x])
                if fact != 0:
                    xadded = True
                    H += fact * qml.PauliZ(x)
                for j in range(self.n):
                    for p in range(self.P):
                        y = j*self.P + p
                        fact = v1[x, y]
                        if fact != 0:
                            xadded = True
                            H += fact * qml.PauliZ(x) @ qml.PauliZ(y)
                if not xadded:
                    H += 0.0 * qml.PauliZ(x)

            if self.flag0:
                x = i * self.P
                H += -(1/2)*qml.PauliZ(x) + (1/4)*qml.PauliZ(x) @ qml.PauliZ(x)

        self.ham = H

def main():
    tstart = time.time()

    logging.info('Running experiment0')
    cfg = Params0()
    cfg.ensemble_num = 10
    P = 1
    n = 21
    cfg.num_of_qubits = n * P
    cfg.max_iterations = 10

    cfg.experiment_name = 'experiment0'
    cfg.hamiltonian_type = 'Hamiltonian0'

    A = np.random.randn(n, n)
    b = np.random.randint(2 ** P, size=n, requires_grad=False)
    flag0 = False

    res_vec = []
    H = MyHamiltonian0(cfg.num_of_qubits, A=A, b=b, P=P, flag0=flag0)
    logging.info("Starting the optimization...")
    gs_energy, __ = runtest(H.ham, cfg, show=False)
    res_vec.append(gs_energy)
    logging.info("Optimization finished.")
    logging.info("TIME= %d [sec]", int(np.round((time.time() - tstart))))
    logging.info("Ground state energy: %s", res_vec)
    return
    # -------------------------------------

if __name__ == '__main__':
    main()

and the output:

(pyenv) (base) cuquantum@7806514dd020:~$ python Gilad_Test.py 
2024-05-28 05:48:21,924 Running experiment0
2024-05-28 05:48:26,198 Starting the optimization...
2024-05-28 05:48:26,788 Using device: Lightning GPU PennyLane plugin
Short name: lightning.gpu
Package: pennylane_lightning
Plugin version: 0.36.0
Author: Xanadu Inc.
Wires: 21
Shots: None
2024-05-28 05:48:26,789 CUDA is not available. Using CPU.
2024-05-28 05:48:26,790 CUDA check details:
2024-05-28 05:48:26,790 CUDA available: False
2024-05-28 05:48:26,790 CUDA device count: 0

I built the torch from source and now I get:

(cuquantum-24.03) cuquantum@7806514dd020:~$ python Gilad_Test.py 
2024-05-28 07:11:14,614 Running experiment0
2024-05-28 07:11:18,819 Starting the optimization...
2024-05-28 07:11:19,331 Using device: Lightning GPU PennyLane plugin
Short name: lightning.gpu
Package: pennylane_lightning
Plugin version: 0.36.0
Author: Xanadu Inc.
Wires: 21
Shots: None
2024-05-28 07:11:19,332 CUDA is available. Using GPU.

But it doesn't seem the it's really working...

I tried also updating the code:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # Specify the index of the GPU you want to use

import time
import logging
import torch
import pennylane as qml
from matplotlib import pyplot as plt
from pennylane import numpy as np

# Define the directory to save outputs
output_dir = "output"
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# Set up logging
log_file_path = os.path.join(output_dir, "output_log.txt")
logging.basicConfig(filename=log_file_path, level=logging.INFO, 
                    format='%(asctime)s %(message)s', filemode='w')
console = logging.StreamHandler()
console.setLevel(logging.INFO)
formatter = logging.Formatter('%(asctime)s %(message)s')
console.setFormatter(formatter)
logging.getLogger('').addHandler(console)

def circuit0_basic(params, wires):
    n_qubits = len(wires)
    n_rotations = len(params)
    if n_rotations > 1:
        n_layers = n_rotations // n_qubits
        n_extra_rots = n_rotations - n_layers * n_qubits

        for layer_idx in range(n_layers):
            layer_params = params[layer_idx * n_qubits: layer_idx * n_qubits + n_qubits, :]
            qml.broadcast(qml.Rot, wires, pattern="single", parameters=layer_params)
            qml.broadcast(qml.CNOT, wires, pattern="ring")

        extra_params = params[-n_extra_rots:, :]
        extra_wires = wires[: n_qubits - 1 - n_extra_rots: -1]
        qml.broadcast(qml.Rot, extra_wires, pattern="single", parameters=extra_params)
    else:
        qml.Rot(*params[0], wires=wires[0])

def circuit0(params, wires):
    n_qubits = len(wires)
    n_rotations = len(params)

    if n_rotations > 1:
        n_layers = n_rotations // n_qubits
        n_extra_rots = n_rotations - n_layers * n_qubits

        for layer_idx in range(n_layers):
            layer_params = params[layer_idx * n_qubits: layer_idx * n_qubits + n_qubits, :]
            qml.broadcast(qml.Rot, wires, pattern="single", parameters=layer_params)
            qml.broadcast(qml.CNOT, wires, pattern="ring")

        extra_params = params[-n_extra_rots:, :]
        extra_wires = wires[: n_qubits - 1 - n_extra_rots: -1]
        qml.broadcast(qml.Rot, extra_wires, pattern="single", parameters=extra_params)
    else:
        qml.Rot(*params[0], wires[wires[0]])

def runtest(H, cfg, show=False):
    max_iterations = cfg.max_iterations
    num_qubits = len(H.wires)
    num_param_sets = (2 ** num_qubits) - 1

    # Initialize parameters directly on the GPU
    params = torch.tensor(np.random.uniform(low=-np.pi / 2, high=np.pi / 2, size=(num_param_sets, 3)), requires_grad=True, device='cuda', dtype=torch.float32)

    dev = qml.device("lightning.gpu", wires=num_qubits, batch_obs=True)
    logging.info(f"Using device: {dev}")

    @qml.qnode(dev, interface='torch')
    def cost_fn(params):
        circuit0_basic(params, wires=H.wires)
        return qml.expval(H)

    @qml.qnode(dev, interface='torch')
    def state_fn(params):
        circuit0_basic(params, wires=H.wires)
        return qml.state()

    # Check if GPU is available and log it after device is set
    if torch.cuda.is_available():
        logging.info("CUDA is available. Using GPU.")
    else:
        logging.info("CUDA is not available. Using CPU.")
        logging.info("CUDA check details:")
        logging.info(f"CUDA available: {torch.cuda.is_available()}")
        logging.info(f"CUDA device count: {torch.cuda.device_count()}")
        if torch.cuda.is_available():
            logging.info(f"CUDA device name: {torch.cuda.get_device_name(0)}")

    opt = torch.optim.Adam([params], lr=0.1)
    conv_tol = 1e-06
    energy_plot = []
    prev_energy = cost_fn(params).item()

    for n in range(max_iterations):
        opt.zero_grad()
        energy = cost_fn(params)
        energy.backward()
        opt.step()
        energy = energy.item()
        logging.info(f"Iteration {n + 1}/{max_iterations}: Energy = {energy:.6f}")
        energy_plot.append(energy)

        if np.abs(energy - prev_energy) <= conv_tol:
            logging.info("Convergence reached!")
            break
        prev_energy = energy

        # Print progress
        progress = (n + 1) / max_iterations * 100
        logging.info(f"Progress: {progress:.2f}%")

    logging.info("Optimization completed.")

    # Using the params, find the ground state vector
    best_params = params
    ground_state = state_fn(best_params)

    # Plot energies
    plt.clf()
    plt.plot(energy_plot)
    plt.xlabel("Iterations")
    plt.ylabel("Energy")
    plt.title("Energy at each iteration")
    energy_plot_path = os.path.join(output_dir, "energy_plot.png")
    plt.savefig(energy_plot_path)
    if show:
        plt.show()

    return energy, ground_state

class Params0:
    pass

class Hamiltonian:

    def __init__(self, N):
        self.ham = qml.Hamiltonian([], [])
        self.energies = None
        self.states = None
        self.gs_energy = None
        self.gs_state = None

        self.N = N  # Number of qubits

class MyHamiltonian0(Hamiltonian):

    def __init__(self, N, A, b, P=1, flag0=False):
        n = N // P
        super().__init__(N)
        self.flag0 = flag0
        self.P = P
        self.n = n
        self.A = A.to(torch.float32).to('cuda')  # Move to GPU and ensure float32 dtype
        self.b = b.to(torch.float32).to('cuda')  # Move to GPU and ensure float32 dtype
        self.y = self.A @ self.b.reshape(-1, 1)
        self.y = self.y.clone().detach().requires_grad_(True)  # Properly construct tensor from existing tensor
        self.set_hamiltonian()

    def get_evec(self):
        return self.noise_std * torch.normal(mean=0, std=1, size=(self.n,), device='cuda', requires_grad=False)

    def set_hamiltonian(self):

        def func1():
            param1, param2 = self.A.shape
            w1 = torch.zeros((param2, param2), device='cuda', dtype=torch.float32)
            w2 = torch.zeros(param2, device='cuda', dtype=torch.float32)
            for m in range(param1):
                for i in range(param2):
                    w2[i] += -2 * self.A[m, i] * self.y[m, 0]
                    for j in range(param2):
                        w1[i, j] += self.A[m, i] * self.A[m, j]
            return w1, w2

        def func2():
            w1, w2 = func1()
            min_val = -2 ** (self.P - 1) + 1
            param1, param2 = w1.shape
            v1 = torch.zeros((param2 * self.P, param2 * self.P), device='cuda', dtype=torch.float32)
            v2 = torch.zeros(param2 * self.P, device='cuda', dtype=torch.float32)
            for i in range(param1):
                for s in range(self.P):
                    v2[self.P * i + s] += (2 ** s) * w2[i]
                    for j in range(param2):
                        v2[self.P * i + s] += (2 ** s) * 2 * min_val * w1[i, j]
                        for p in range(self.P):
                            v1[self.P * i + s, self.P * j + p] += (2 ** (s + p)) * w1[i, j]
            return v1, v2

        v1, v2 = func2()

        H = qml.Hamiltonian([], [])
        for i in range(self.n):
            for s in range(self.P):
                xadded = False
                x = i * self.P + s
                fact = - (sum(v1[x, :]) + sum(v1[:, x]) + 2*v2[x])
                if fact != 0:
                    xadded = True
                    H += fact * qml.PauliZ(x)
                for j in range(self.n):
                    for p in range(self.P):
                        y = j*self.P + p
                        fact = v1[x, y]
                        if fact != 0:
                            xadded = True
                            H += fact * qml.PauliZ(x) @ qml.PauliZ(y)
                if not xadded:
                    H += 0.0 * qml.PauliZ(x)

            if self.flag0:
                x = i * self.P
                H += -(1/2)*qml.PauliZ(x) + (1/4)*qml.PauliZ(x) @ qml.PauliZ(x)

        self.ham = H

def main():
    tstart = time.time()

    logging.info('Running experiment0')
    cfg = Params0()
    cfg.ensemble_num = 10
    P = 1
    n = 21  # Increase number of qubits
    cfg.num_of_qubits = n * P
    cfg.max_iterations = 100  # Increase the number of iterations for better GPU utilization

    cfg.experiment_name = 'experiment0'
    cfg.hamiltonian_type = 'Hamiltonian0'

    A = torch.randn(n, n, device='cuda', dtype=torch.float32)  # Move to GPU and ensure float32 dtype
    b = torch.randint(2 ** P, size=(n,), device='cuda', dtype=torch.float32, requires_grad=False)  # Move to GPU and ensure float32 dtype
    flag0 = False

    res_vec = []
    H = MyHamiltonian0(cfg.num_of_qubits, A=A, b=b, P=P, flag0=flag0)
    logging.info("Starting the optimization...")
    gs_energy, __ = runtest(H.ham, cfg, show=False)
    res_vec.append(gs_energy)
    logging.info("Optimization finished.")
    logging.info("TIME= %d [sec]", int(np.round((time.time() - tstart))))
    logging.info("Ground state energy: %s", res_vec)
    return
    # -------------------------------------

if __name__ == '__main__':
    main()

But the GPU Memory Usage is very low:

I think this is a MIG issue, trying to figure it out.

If you try swapping the device for default.qubit and use PyTorch with CUDA-mapped tensors, does the GPU work? You can likely pick a smaller scale workload for this (e.g. something from the Torch GPU tests at https://github.com/PennyLaneAI/pennylane/blob/59a1e0586e707d057a0c92d4239036afa5312b73/tests/interfaces/test_torch.py#L399).

If this runs on the GPU without issue, it may be a runtime issue with LGPU. If not, then most likely the MIG/some CUDA driver issue on the node.

We fixed the MIG issue and now the following code works, I'm trying to find a way to accelerate the program because I feel like I'm not utilizing the entire GH200. If you know more ways to accelerate, that would help me a lot. code:

from mpi4py import MPI
import pennylane as qml
from matplotlib import pyplot as plt
from pennylane import numpy as np
import time

def circuit0_basic(params, wires):
    n_qubits = len(wires)
    n_rotations = len(params)
    if n_rotations > 1:
        n_layers = n_rotations // n_qubits
        n_extra_rots = n_rotations - n_layers * n_qubits

        for layer_idx in range(n_layers):
            layer_params = params[layer_idx * n_qubits: layer_idx * n_qubits + n_qubits, :]
            qml.broadcast(qml.Rot, wires, pattern="single", parameters=layer_params)
            qml.broadcast(qml.CNOT, wires, pattern="ring")

        extra_params = params[-n_extra_rots:, :]
        extra_wires = wires[: n_qubits - 1 - n_extra_rots: -1]
        qml.broadcast(qml.Rot, extra_wires, pattern="single", parameters=extra_params)
    else:
        qml.Rot(*params[0], wires=wires[0])

def circuit0(params, wires):
    n_qubits = len(wires)
    n_rotations = len(params)

    if n_rotations > 1:
        n_layers = n_rotations // n_qubits
        n_extra_rots = n_rotations - n_layers * n_qubits

        for layer_idx in range(n_layers):
            layer_params = params[layer_idx * n_qubits: layer_idx * n_qubits + n_qubits, :]
            qml.broadcast(qml.Rot, wires, pattern="single", parameters=layer_params)
            qml.broadcast(qml.CNOT, wires, pattern="ring")

        extra_params = params[-n_extra_rots:, :]
        extra_wires = wires[: n_qubits - 1 - n_extra_rots: -1]
        qml.broadcast(qml.Rot, extra_wires, pattern="single", parameters=extra_params)
    else:
        qml.Rot(*params[0], wires=wires[0])

def runtest(H, cfg, show=False):
    max_iterations = cfg.max_iterations
    num_qubits = len(H.wires)
    num_param_sets = (2 ** num_qubits) - 1

    params = np.random.uniform(low=-np.pi / 2, high=np.pi / 2, size=(num_param_sets, 3))
    params = np.array(params, requires_grad=True)

    # Enable state access by setting shots=None
    dev = qml.device("lightning.gpu", wires=num_qubits, shots=None, batch_obs=True, mpi = True)

    @qml.qnode(dev, diff_method="adjoint")
    def cost_fn(params):
        circuit0_basic(params, wires=H.wires)
        return qml.expval(H)

    @qml.qnode(dev, diff_method="adjoint")
    def state_fn(params):
        circuit0_basic(params, wires=H.wires)
        return qml.state()

    opt = qml.AdamOptimizer(stepsize=0.1)
    conv_tol = 1e-06
    energy_plot = []
    prev_energy = cost_fn(params)

    for n in range(max_iterations):
        params, energy = opt.step_and_cost(cost_fn, params)
        print("Energy for iteration " + str(n) + " : " + str(energy))
        energy_plot.append(energy)

        if np.abs(energy - prev_energy) <= conv_tol:
            break
        prev_energy = energy

    # Using the params, find the ground state vector
    best_params = params
    ground_state = state_fn(best_params)

    # Plot energies
    plt.clf()
    plt.plot(energy_plot)
    plt.xlabel("Iterations")
    plt.ylabel("Energy")
    plt.title("Energy at each iteration")
    plt.savefig("energy_plot.png")
    if show:
        plt.show()

    return energy, ground_state

class Params0:
    pass

class Hamiltonian:

    def __init__(self, N):
        self.ham = qml.Hamiltonian([], [])
        self.energies = None
        self.states = None
        self.gs_energy = None
        self.gs_state = None

        self.N = N  # Numer of qubits

class MyHamiltonian0(Hamiltonian):

    def __init__(self, N, A, b, P=1, flag0=False):
        n = N // P
        super().__init__(N)
        self.flag0 = flag0
        self.P = P
        self.n = n
        self.A = A
        self.b = b
        self.y = self.A @ self.b.reshape(-1,1)
        self.y = np.matrix(self.y)
        self.set_hamiltonian()

    def get_evec(self):
        return self.noise_std * np.random.normal(size=self.n, requires_grad=False)

    def set_hamiltonian(self):

        def func1():
            param1, param2 = self.A.shape
            w1 = np.zeros((param2, param2))
            w2 = np.zeros(param2)
            for m in range(param1):
                for i in range(param2):
                    w2[i] += -2 * self.A[m, i] * self.y[m, 0]
                    for j in range(param2):
                        w1[i, j] += self.A[m, i] * self.A[m, j]
            return w1, w2

        def func2():
            w1, w2 = func1()
            min_val = -2 ** (self.P - 1) + 1
            param1, param2 = w1.shape
            v1 = np.zeros((param2 * self.P, param2 * self.P))
            v2 = np.zeros(param2 * self.P)
            for i in range(param1):
                for s in range(self.P):
                    v2[self.P * i + s] += (2 ** s) * w2[i]
                    for j in range(param2):
                        v2[self.P * i + s] += (2 ** s) * 2 * min_val * w1[i, j]
                        for p in range(self.P):
                            v1[self.P * i + s, self.P * j + p] += (2 ** (s + p)) * w1[i, j]
            return v1, v2

        v1, v2 = func2()

        H = qml.Hamiltonian([], [])
        for i in range(self.n):
            for s in range(self.P):
                xadded = False
                x = i * self.P + s
                fact = - (sum(v1[x, :]) + sum(v1[:, x]) + 2*v2[x])
                if fact != 0:
                    xadded = True
                    H += fact * qml.PauliZ(x)
                for j in range(self.n):
                    for p in range(self.P):
                        y = j*self.P + p
                        fact = v1[x, y]
                        if fact != 0:
                            xadded = True
                            H += fact * qml.PauliZ(x) @ qml.PauliZ(y)
                if not xadded:
                    H += 0.0 * qml.PauliZ(x)

            if self.flag0:
                x = i * self.P
                H += -(1/2)*qml.PauliZ(x) + (1/4)*qml.PauliZ(x) @ qml.PauliZ(x)

        self.ham = H

def main():
    tstart = time.time()

    print('Running experiment0')
    cfg = Params0()
    cfg.ensemble_num = 10
    P = 1
    n = 21
    cfg.num_of_qubits = n * P
    cfg.max_iterations = 10

    cfg.experiment_name = 'experiment0'
    cfg.hamiltonian_type = 'Hamiltonian0'

    A = np.random.randn(n, n)
    b = np.random.randint(2 ** P, size=n, requires_grad=False)
    flag0 = False

    res_vec = []
    H = MyHamiltonian0(cfg.num_of_qubits, A=A, b=b, P=P, flag0=flag0)
    gs_energy, __ = runtest(H.ham, cfg, show=False)
    res_vec.append(gs_energy)
    print("TIME=", int(np.round((time.time() - tstart))), " [sec]")
    print(res_vec)
    return
    # -------------------------------------

if __name__ == '__main__':
    main()

Hi @giladqm,

Accelerating programs is an art. You're already doing the best you can by using lightning and adjoint. You may try other tricks like changing the default values in some of the keyword arguments in the QNode, but you probably will only get minor improvements if any. You could also try changing the optimizer to see if this helps. The main issue here is that you're using over 6 million parameters. This is a lot so it's natural for your program to be slow.

If you're noticing that your GPU usage isn't 100% it's probably because your bottleneck is on the CPU side of things. I'm guessing this is also related to the number of parameters that you have.

In other cases something like circuit cutting might help but in your case you have so many CNOTs that it probably won't help.

Feel free to explore the PennyLane Discussion Forum to see what others have tried to accelerate their programs too.

Thanks @CatalinaAlbornoz. One thing I find really weird is that for the same code where n=7 ,default.qubit takes 4 seconds but with lightning.gpu it takes 100-140 seconds. What is the reason for this?

Hi @giladqm, lightning.gpu is optimized to work with over 20 qubits. There's a big overhead in spinning up all of the processes needing and passing the information from CPU to GPU and viceversa. So for smaller circuits default.qubit or lightning.qubit will work better.

If you go to pennylane.ai/performance you'll notice that at the end of the page we have a table to help you choose between the simulators depending on your circuit. Definitely take a look at this page, you may find some new insights. 😃

I appreciate the explanation, Thank you @CatalinaAlbornoz

@CatalinaAlbornoz is there somewhere I can read more about the types of devices?

PennyLaneAI / pennylane-lightning

Using docker pennylane-lightning-gpu on GH200 #742

This is -- a caller that pip uses to run setup.py

- It imports setuptools before invoking setup.py, to enable projects that directly

import from `distutils.core` to work with newer packaging standards.

- It provides a clear error message when setuptools is not installed.

- It sets `sys.argv[0]` to the underlying `setup.py`, when invoking `setup.py` so

setuptools doesn'"'"'t think the script is `-c`. This avoids the following warning:

manifest_maker: standard file '"'"'-c'"'"' not found".

- It generates a shim setup.py, for handling setup.cfg-only projects.

PennyLaneAI / pennylane-lightning

Using docker pennylane-lightning-gpu on GH200 #742

This is -- a caller that pip uses to run setup.py

- It imports setuptools before invoking setup.py, to enable projects that directly

import from distutils.core to work with newer packaging standards.

- It provides a clear error message when setuptools is not installed.

- It sets sys.argv[0] to the underlying setup.py, when invoking setup.py so

setuptools doesn'"'"'t think the script is -c. This avoids the following warning:

manifest_maker: standard file '"'"'-c'"'"' not found".

- It generates a shim setup.py, for handling setup.cfg-only projects.

import from `distutils.core` to work with newer packaging standards.

- It sets `sys.argv[0]` to the underlying `setup.py`, when invoking `setup.py` so

setuptools doesn'"'"'t think the script is `-c`. This avoids the following warning: