Circuit execution suffers a 4x slowdown for tensorflow interface

lmondada commented 3 years ago

First of all, thank you very much for all your work! I have been having a lot of fun using Pennylane.

Issue description

Changing from the default autograd interface to tensorflow comes with a huge slowdown of circuit simulation times when using the default.qubit or default.qubit.tf plugins. A speed penalty is also observed for other simulator, albeit to a lesser extent.

This table summarises what I mean. It shows the execution times of circuits with 10 quits (60 trainable randomly initialised parameters) on different simulators, using either the autograd or the tensorflow interface.

The numbers are similar for larger 20 qubit circuits (I have also tried using pytorch at some point, I seemed to have similar issues, but never ran benchmarks). This essentially means that I have found no way to train faster than using default.qubit on CPU (I would ideally use GPUs, but so far I would not know how to get any speedup at all).

Expected behavior: Using default.qubit.tf on the TensorFlow interface should have similar performance than default.qubit on autograd. I would also expect that optimised simulators such as qulacs or qiskit.aer outperform default.qubit.
Actual behavior: all simulators are slower than default.qubit
Reproduces how often: always

System information: (ran on Google Colab)

Version: 0.14.1
Summary: PennyLane is a Python quantum machine learning library by Xanadu Inc.
Home-page: https://github.com/XanaduAI/pennylane
Author: None
Author-email: None
License: Apache License 2.0
Location: /usr/local/lib/python3.7/dist-packages
Requires: appdirs, numpy, networkx, toml, semantic-version, scipy, autograd
Required-by: pennylane-qulacs, PennyLane-qiskit
Platform info:           Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic
Python version:          3.7.10
Numpy version:           1.19.5
Scipy version:           1.4.1

Source code and tracebacks

The timings are obtained from code along the lines of

qnode = qml.QNode(some_dev, interface=some_interface)
# time circuit simulation
timeit("qnode(params)")

A complete notebook with the code used to produce the above table is here.

Is this well-known? If so, what is the bottleneck and what would you suggest I do to scale my training to larger systems? I look forward to hearing from you!

Luca

EDIT: a previous version of this included execution times for gradient computations. These were incorrect and irrelevant, so I removed them.

josh146 commented 3 years ago

Hi @lmondada! Thanks for posting the data here, that is super helpful.

There is one more variable that would be important to know in the benchmarking, particular for the 'gradient' column, which is the differentiation method. Are you using diff_method="parameter-shift" for all, or diff_method="backprop" for supported combinations?

lmondada commented 3 years ago

Hi @josh146 !

Thanks for your quick reply. diff_method is left to default, which I believe defaults to backprop wherever possible -- I should have set that explicitly...

That being said, the main column I am looking at is the circuit evaluation column. Gradient computation seems to have very similar performance throughout.

lmondada commented 3 years ago

Hi @lmondada! Thanks for posting the data here, that is super helpful.

There is one more variable that would be important to know in the benchmarking, particular for the 'gradient' column, which is the differentiation method. Are you using diff_method="parameter-shift" for all, or diff_method="backprop" for supported combinations?

Just realised that the execution times of gradients were incorrect, so I removed them. As mentioned above, I am mostly looking at circuit evaluation times anyway.

josh146 commented 3 years ago

@lmondada, would you be able to post the QNode you are using in the benchmarking? This is just a guess, but we have previously seen very large slowdowns in the TensorFlow interface if iterating over the elements in a TensorFlow tensor. This could be happening inside a template if you are using one.

lmondada commented 3 years ago

Hi, Apologies for the long silence! The qnode in this example is a single StronglyEntanglingLayers, with 2 layers and 10 wires. Looking at its source code, it does loop over the parameters to insert the gates of the ansatz.

Wouldn’t any ansatz have to follow a similar structure? Is there a way to avoid this? Thank you @josh146 for your help!

josh146 commented 3 years ago

From what I recall of our exploration, the following caused a significant slowdown in TensorFlow:

for w in weights:
    qml.RX(w, wires=0)

However, I believe the following change can lead to significant improvement:

for i in range(4):
    qml.RX(w[i], wires=0)

lmondada commented 3 years ago

Hmmm, looking at the qml.templates.broadcast code, this seems to be fixed already, so the issue must be somewhere else.

Just for reference the QNode I have been using:

@qml.qnode(dev, interface='tf')
def circuit(params):
    qml.templates.StronglyEntanglingLayers(weights=params, wires=wires)
    return qml.expval(qml.operation.Tensor(*[qml.PauliZ(wires=i) for i in wires]))

josh146 commented 3 years ago

Oh perfect, thanks @lmondada! Do you also have the parameters and the number of wires you were using to benchmark?

lmondada commented 3 years ago

Sure! This is what I used

n_wires = 10
n_layers = 2
params_shape = (n_layers, n_wires, 3)
params = np.random.rand(*params_shape)
wires = np.arange(n_wires)

You can find the entire code that generated those timings here

anthayes92 commented 3 years ago

Hi @lmondada, we have had a look into profiling and comparing the default.qubit in the tensorflow and autograd interfaces. It looks as though there is a dispatch wrapper being called on the tensorflow side which contributing to slowing down performance here.

Thanks for raising this issue, we will look into this further. And thanks again for sharing your insightful results!

albi3ro commented 1 year ago

Since so much of PennyLane has changed since the time this was opened, I'm going to go ahead and close this as it is a stale issue.

PennyLaneAI / pennylane

Circuit execution suffers a 4x slowdown for tensorflow interface #1176

Issue description

Source code and tracebacks