Open mtarabkhah opened 1 month ago
Hi @mtarabkhah! Thanks for the benchmarking information.
Catalyst doesn't yet support lightning.gpu
, but this in work in progress and coming shortly. I'm curious how you benchmarking Catalyst with GPU support?
P.S. I have not used for loops in the creation of the circuits, as I am using code to automatically generate the circuits based on a provided list of gates. I assume this should not affect the performance.
Note that Catalyst compatible for loops (either using qml.for_loop
, or qjit(autograph=True)
) will in fact lead to an increase in performance, as the circuit will have a compressed representation :)
Hi @josh146,
Thanks for your reply.
I'm currently using lightning.gpu
from PennyLane for the GPU version. Is there another way to use Catalyst for GPU execution?
Could you please review the provided code and suggest ways to improve performance, particularly using Catalyst on GPU?
I also appreciate the comment about "Catalyst-compatible for loops" and will look into that.
Issue description
I am benchmarking quantum circuits using Catalyst on a GPU. However, the speedup over CPU execution seems unexpectedly low.
Expected behavior: Much higher speedup (order of ~1000x speedup)
Actual behavior: ~5x speedup using GPU over CPU
Reproduces how often: always
System information:
Platform info: Linux-6.8.0-45-generic-x86_64-with-glibc2.39 Python version: 3.12.4 Numpy version: 1.26.4 Scipy version: 1.12.0 Installed devices:
Source code and tracebacks
I have provided 2 sample code with more information on the execution times in Catalyst-GPU-QS Repo
Additional information
Here are some sample execution times for a 26-qubit GHZ circuit:
lightning.qubit
on CPU):Execution time: 2.6811 seconds
lightning.gpu
on GPU):Execution time: 0.5751 seconds
This results in a 4.66x speedup with the GPU version, which seems relatively low for GPU acceleration.
For comparison, running this quantum circuit in Qiskit yielded the following:
P.S. I have not used for loops in the creation of the circuits, as I am using code to automatically generate the circuits based on a provided list of gates. I assume this should not affect the performance.