Closed lmondada closed 1 year ago
Hi @lmondada! Thanks for posting the data here, that is super helpful.
There is one more variable that would be important to know in the benchmarking, particular for the 'gradient' column, which is the differentiation method. Are you using diff_method="parameter-shift"
for all, or diff_method="backprop"
for supported combinations?
Hi @josh146 !
Thanks for your quick reply. diff_method
is left to default, which I believe defaults to backprop
wherever possible -- I should have set that explicitly...
That being said, the main column I am looking at is the circuit evaluation column. Gradient computation seems to have very similar performance throughout.
Hi @lmondada! Thanks for posting the data here, that is super helpful.
There is one more variable that would be important to know in the benchmarking, particular for the 'gradient' column, which is the differentiation method. Are you using
diff_method="parameter-shift"
for all, ordiff_method="backprop"
for supported combinations?
Just realised that the execution times of gradients were incorrect, so I removed them. As mentioned above, I am mostly looking at circuit evaluation times anyway.
@lmondada, would you be able to post the QNode you are using in the benchmarking? This is just a guess, but we have previously seen very large slowdowns in the TensorFlow interface if iterating over the elements in a TensorFlow tensor. This could be happening inside a template if you are using one.
Hi,
Apologies for the long silence! The qnode in this example is a single StronglyEntanglingLayers
, with 2 layers and 10 wires.
Looking at its source code, it does loop over the parameters to insert the gates of the ansatz.
Wouldn’t any ansatz have to follow a similar structure? Is there a way to avoid this? Thank you @josh146 for your help!
From what I recall of our exploration, the following caused a significant slowdown in TensorFlow:
for w in weights:
qml.RX(w, wires=0)
However, I believe the following change can lead to significant improvement:
for i in range(4):
qml.RX(w[i], wires=0)
Hmmm, looking at the qml.templates.broadcast
code, this seems to be fixed already, so the issue must be somewhere else.
Just for reference the QNode I have been using:
@qml.qnode(dev, interface='tf')
def circuit(params):
qml.templates.StronglyEntanglingLayers(weights=params, wires=wires)
return qml.expval(qml.operation.Tensor(*[qml.PauliZ(wires=i) for i in wires]))
Oh perfect, thanks @lmondada! Do you also have the parameters and the number of wires you were using to benchmark?
Sure! This is what I used
n_wires = 10
n_layers = 2
params_shape = (n_layers, n_wires, 3)
params = np.random.rand(*params_shape)
wires = np.arange(n_wires)
You can find the entire code that generated those timings here
Hi @lmondada, we have had a look into profiling and comparing the default.qubit
in the tensorflow
and autograd
interfaces. It looks as though there is a dispatch
wrapper being called on the tensorflow
side which contributing to slowing down performance here.
Thanks for raising this issue, we will look into this further. And thanks again for sharing your insightful results!
Since so much of PennyLane has changed since the time this was opened, I'm going to go ahead and close this as it is a stale issue.
First of all, thank you very much for all your work! I have been having a lot of fun using Pennylane.
Issue description
Changing from the default
autograd
interface totensorflow
comes with a huge slowdown of circuit simulation times when using thedefault.qubit
ordefault.qubit.tf
plugins. A speed penalty is also observed for other simulator, albeit to a lesser extent.This table summarises what I mean. It shows the execution times of circuits with 10 quits (60 trainable randomly initialised parameters) on different simulators, using either the
autograd
or thetensorflow
interface.The numbers are similar for larger 20 qubit circuits (I have also tried using
pytorch
at some point, I seemed to have similar issues, but never ran benchmarks). This essentially means that I have found no way to train faster than usingdefault.qubit
on CPU (I would ideally use GPUs, but so far I would not know how to get any speedup at all).Expected behavior: Using
default.qubit.tf
on the TensorFlow interface should have similar performance thandefault.qubit
onautograd
. I would also expect that optimised simulators such asqulacs
orqiskit.aer
outperformdefault.qubit
.Actual behavior: all simulators are slower than
default.qubit
Reproduces how often: always
System information: (ran on Google Colab)
Source code and tracebacks
The timings are obtained from code along the lines of
A complete notebook with the code used to produce the above table is here.
Is this well-known? If so, what is the bottleneck and what would you suggest I do to scale my training to larger systems? I look forward to hearing from you!
Luca
EDIT: a previous version of this included execution times for gradient computations. These were incorrect and irrelevant, so I removed them.