The speedup using GPU over CPU execution seems unexpectedly low

mtarabkhah commented 1 month ago

Issue description

I am benchmarking quantum circuits using Catalyst on a GPU. However, the speedup over CPU execution seems unexpectedly low.

Expected behavior: Much higher speedup (order of ~1000x speedup)
Actual behavior: ~5x speedup using GPU over CPU
Reproduces how often: always

System information:


Name: PennyLane
Version: 0.38.0
Summary: PennyLane is a cross-platform Python library for quantum computing, quantum machine learning, and quantum chemistry. Train a quantum computer the same way as a neural network.
Home-page: https://github.com/PennyLaneAI/pennylane
Author: 
Author-email: 
License: Apache License 2.0
Location: /home/mei/.local/lib/python3.12/site-packages
Requires: appdirs, autograd, autoray, cachetools, networkx, numpy, packaging, pennylane-lightning, requests, rustworkx, scipy, toml, typing-extensions
Required-by: PennyLane-Catalyst, PennyLane_Lightning, PennyLane_Lightning_GPU, PennyLane_Lightning_Kokkos

Platform info: Linux-6.8.0-45-generic-x86_64-with-glibc2.39 Python version: 3.12.4 Numpy version: 1.26.4 Scipy version: 1.12.0 Installed devices:

lightning.gpu (PennyLane_Lightning_GPU-0.38.0)
nvidia.custatevec (PennyLane-Catalyst-0.8.1)
nvidia.cutensornet (PennyLane-Catalyst-0.8.1)
oqc.cloud (PennyLane-Catalyst-0.8.1)
softwareq.qpp (PennyLane-Catalyst-0.8.1)
default.clifford (PennyLane-0.38.0)
default.gaussian (PennyLane-0.38.0)
default.mixed (PennyLane-0.38.0)
default.qubit (PennyLane-0.38.0)
default.qubit.autograd (PennyLane-0.38.0)
default.qubit.jax (PennyLane-0.38.0)
default.qubit.legacy (PennyLane-0.38.0)
default.qubit.tf (PennyLane-0.38.0)
default.qubit.torch (PennyLane-0.38.0)
default.qutrit (PennyLane-0.38.0)
default.qutrit.mixed (PennyLane-0.38.0)
default.tensor (PennyLane-0.38.0)
null.qubit (PennyLane-0.38.0)
lightning.kokkos (PennyLane_Lightning_Kokkos-0.39.0.dev11)
lightning.qubit (PennyLane_Lightning-0.38.0)

Source code and tracebacks

I have provided 2 sample code with more information on the execution times in Catalyst-GPU-QS Repo

Additional information

Here are some sample execution times for a 26-qubit GHZ circuit:

GHZ1 (using lightning.qubit on CPU):
Execution time: 2.6811 seconds
GHZ2 (using lightning.gpu on GPU):
Execution time: 0.5751 seconds

This results in a 4.66x speedup with the GPU version, which seems relatively low for GPU acceleration.

For comparison, running this quantum circuit in Qiskit yielded the following:

GPU execution in Qiskit was ~760x faster than the CPU version.
While Catalyst showed better CPU performance than Qiskit, its GPU performance lagged behind.

P.S. I have not used for loops in the creation of the circuits, as I am using code to automatically generate the circuits based on a provided list of gates. I assume this should not affect the performance.

josh146 commented 1 month ago

Hi @mtarabkhah! Thanks for the benchmarking information.

Catalyst doesn't yet support lightning.gpu, but this in work in progress and coming shortly. I'm curious how you benchmarking Catalyst with GPU support?

josh146 commented 1 month ago

P.S. I have not used for loops in the creation of the circuits, as I am using code to automatically generate the circuits based on a provided list of gates. I assume this should not affect the performance.

Note that Catalyst compatible for loops (either using qml.for_loop, or qjit(autograph=True)) will in fact lead to an increase in performance, as the circuit will have a compressed representation :)

mtarabkhah commented 1 month ago

Hi @josh146,

Thanks for your reply.

I'm currently using lightning.gpu from PennyLane for the GPU version. Is there another way to use Catalyst for GPU execution?

Could you please review the provided code and suggest ways to improve performance, particularly using Catalyst on GPU?

I also appreciate the comment about "Catalyst-compatible for loops" and will look into that.

PennyLaneAI / catalyst