VQC training on GPU-backend (Qiskit, qulacs) ~3x slower than default.qubit

iamlucaswolf commented 3 years ago

Issue description

I am trying to run some experiments with Variational Quantum Circuits.

Curiously training seems to be consistently slower when done on GPU, irrelevant of the used backend. In particular, I compared the training and inference times of three devices:

'default.qubit' 'qiskit.aer', backend='statevector_gpu' (with qiskit-aer-gpu installed) 'qulacs.simulator', gpu=True

Throughout my experiments 'default.qubit' was faster by roughly 3x compared to the GPU-bound versions. I measured running times with IPython's %%timeit magic, trained with PyTorch, and tried circuits/batches of varying size with similar results.

Is this expected behaviour? In general, is the GPU on GPU-aware backends ever used during backpropagation? Thanks!

System information: (post the output of import pennylane as qml; qml.about())

Name: PennyLane Version: 0.12.0 Summary: PennyLane is a Python quantum machine learning library by Xanadu Inc. Home-page: https://github.com/XanaduAI/pennylane Author: None Author-email: None License: Apache License 2.0 Location: /home/iamlucaswolf/.cache/pypoetry/virtualenvs/quantum-rl-UFQBZLzo-py3.8/lib/python3.8/site-packages Requires: semantic-version, appdirs, scipy, numpy, networkx, toml, autograd Required-by: pennylane-qulacs, PennyLane-qiskit Platform info: Linux-5.4.0-53-generic-x86_64-with-glibc2.27 Python version: 3.8.6 Numpy version: 1.19.4 Scipy version: 1.5.4 Installed devices:

default.gaussian (PennyLane-0.12.0)
default.mixed (PennyLane-0.12.0)
default.qubit (PennyLane-0.12.0)
default.qubit.autograd (PennyLane-0.12.0)
default.qubit.tf (PennyLane-0.12.0)
default.tensor (PennyLane-0.12.0)
default.tensor.tf (PennyLane-0.12.0)
qulacs.simulator (pennylane-qulacs-0.12.0)
qiskit.aer (PennyLane-qiskit-0.12.0)
qiskit.basicaer (PennyLane-qiskit-0.12.0)
qiskit.ibmq (PennyLane-qiskit-0.12.0)

josh146 commented 3 years ago

Hi @iamlucaswolf! That is a bit strange. While I'm not 100% sure of the reason for the slowdown, it might be something to do with the quantum differentiation method.

When you use a device that supports a GPU backend (such as qiskit.aer or qulacs.simulator), the device should be using the GPUs for its internal quantum evaluation. However, the quantum part of the model remains a blackbox to the classical optimizer (in this case, PyTorch). When PyTorch encounters a QNode during backpropagation, it passes to PennyLane any intermediate values, and PennyLane takes over computing the gradient.

Importantly, as part of this hand-off, any GPU data is copied to the CPU, and PennyLane then executes the quantum device (potentially on a GPU if using a supported backend). So with every optimization step, data is being copied GPU-CPU-GPU twice, which could the cause of the overhead.

To avoid this, you can instead use a device that supports classical backpropagation (diff_method="backprop") on GPUs, such as default.qubit.tf. If you use TensorFlow as your ML interface, the entire computation will instead be handled by TensorFlow -- there is no handoff, and no GPU-CPU copying of data. This holds and works for TPUs as well as GPUs.

iamlucaswolf commented 3 years ago

Hi @josh146, thank you so much for your quick reply! I had a look at default.qubit.tf and implemented a PyTorch analogue. Would you guys be interested in a PR?

josh146 commented 3 years ago

Hi @iamlucaswolf, definitely! That would be greatly appreciated, contributions are more than welcome.

Feel free to open a work in progress PR (simply begin with [WIP] in the title), and you can also tag is in code comments if you have any questions 🙂

cnktysz commented 3 years ago

Hi @iamlucaswolf how many qubits do you have? GPU acceleration gives you an boost when you have > 20 qubits in general for qulacs. If you have less, I don't think you will see any speed-up. I just thought this might be the reason. Please check the benchmark they provide: https://github.com/qulacs/benchmark-qulacs.

PennyLaneAI / pennylane

VQC training on GPU-backend (Qiskit, qulacs) ~3x slower than default.qubit #919

Issue description