Linear increase in simulation time with Aer simulators after every epoch

maniraman-periyasamy commented 3 years ago

Informations

Qiskit AER version: 0.8.1
Qiskit version: 0.25.1
Operating system: Ubuntu lts 20.04

What is the current behavior?

Simulation time increases linearly after every epoch when training the "Hybrid quantum-classical Neural Networks with PyTorch and Qiskit" example given in Qiskit Textbook using any of the Aer simulators. The simulation time per epoch stays almost the same if BasicAer simulator is used.

Simulation time Vs Epoch number plot for various simulators is given below.

Hybrid_VQC_timeVsepoch

This behavior is not observed in Qiskit-aer version 0.7.6 and Qiskit version 0.24.1. Also, Aer simulator is much faster than BasicAer.

Hybrid_VQC_timeVsepoch_Qiskit24

Steps to reproduce the problem

Running the "Hybrid quantum-classical Neural Networks with PyTorch and Qiskit" example given in Qiskit textbook (It was written for Qiskit 0.23.1 version hence few changes are required to run it using Qiskit 0.25.1 and Qiskit-Aer 0.8.1) or the below-given code snippet.

import numpy as np
import matplotlib.pyplot as plt
import time

import torch
from torch.autograd import Function
from torchvision import datasets, transforms
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F

import qiskit
from qiskit import transpile, assemble
from qiskit.visualization import *
from qiskit.providers.aer import AerSimulator, QasmSimulator, StatevectorSimulator
from qiskit import IBMQ, BasicAer, Aer

class QuantumCircuit:
    """ 
    This class provides a simple interface for interaction 
    with the quantum circuit 
    """

    def __init__(self, n_qubits, backend, shots):
        # --- Circuit definition ---
        self._circuit = qiskit.QuantumCircuit(n_qubits)

        all_qubits = [i for i in range(n_qubits)]
        self.theta = qiskit.circuit.Parameter('θ')

        self._circuit.h(all_qubits)
        self._circuit.barrier()
        self._circuit.rz(self.theta, all_qubits)

        self._circuit.draw(output = 'mpl')
        self._circuit.measure_all()
        # ---------------------------

        self.backend = backend
        self.shots = shots

    def run(self, thetas):

        t_qc = transpile(self._circuit,
                         self.backend)
        qobj = assemble([t_qc.bind_parameters({self.theta: theta}) for theta in thetas],
                        shots=self.shots, backend=self.backend)
        job = self.backend.run(qobj)
        result = job.result().get_counts(0)

        """ param_check = [{self.theta: theta} for theta in thetas]
        result_job = qiskit.execute(self._circuit, backend = self.backend, shots=self.shots,
                        parameter_binds = [{self.theta: theta} for theta in thetas]).result()
        result = result_job.get_counts(0) """

        counts = np.array(list(result.values()))
        states = np.array(list(result.keys())).astype(float)

        # Compute probabilities for each state
        probabilities = counts / self.shots
        # Get state expectation
        expectation = np.sum(states * probabilities)

        return np.array([expectation])

class HybridFunction(Function):
    """ Hybrid quantum - classical function definition """

    @staticmethod
    def forward(ctx, input, quantum_circuit, shift):
        """ Forward pass computation """
        ctx.shift = shift
        ctx.quantum_circuit = quantum_circuit

        expectation_z = ctx.quantum_circuit.run(input[0].tolist())
        result = torch.tensor([expectation_z])
        ctx.save_for_backward(input, result)

        return result

    @staticmethod
    def backward(ctx, grad_output):
        """ Backward pass computation """
        input, expectation_z = ctx.saved_tensors
        input_list = np.array(input.tolist())

        shift_right = input_list + np.ones(input_list.shape) * ctx.shift
        shift_left = input_list - np.ones(input_list.shape) * ctx.shift

        gradients = []
        for i in range(len(input_list)):
            expectation_right = ctx.quantum_circuit.run(shift_right[i])
            expectation_left  = ctx.quantum_circuit.run(shift_left[i])

            gradient = torch.tensor([expectation_right]) - torch.tensor([expectation_left])
            gradients.append(gradient)
        gradients = np.array([gradients]).T
        return torch.tensor([gradients]).float() * grad_output.float(), None, None

class Hybrid(nn.Module):
    """ Hybrid quantum - classical layer definition """

    def __init__(self, backend, shots, shift):
        super(Hybrid, self).__init__()
        self.quantum_circuit = QuantumCircuit(3, backend, shots)
        self.shift = shift

    def forward(self, input):
        return HybridFunction.apply(input, self.quantum_circuit, self.shift)

class Net(nn.Module):
    def __init__(self, backend):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 6, kernel_size=5)
        self.conv2 = nn.Conv2d(6, 16, kernel_size=5)
        self.dropout = nn.Dropout2d()
        self.fc1 = nn.Linear(256, 64)
        self.fc2 = nn.Linear(64, 1)
        self.hybrid = Hybrid(backend, 100, np.pi / 2)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.max_pool2d(x, 2)
        x = F.relu(self.conv2(x))
        x = F.max_pool2d(x, 2)
        x = self.dropout(x)
        x = x.view(1, -1)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        x = self.hybrid(x)
        return torch.cat((x, 1 - x), -1)

def train(backend,epochs, train_loader):

    model = Net(backend)
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    loss_func = nn.NLLLoss()

    loss_list = []
    epoch_time = []
    per_batch_time = 0.0
    backward_time = 0.0
    model.train()
    for epoch in range(epochs):
        per_batch_time = 0.0
        backward_time = 0.0
        batch_count = 0
        start_epoch = time.time()
        total_loss = []
        for batch_idx, (data, target) in enumerate(train_loader):

            start_batch = time.time()
            optimizer.zero_grad()
            # Forward pass
            output = model(data)
            # Calculating loss
            loss = loss_func(output, target)
            # Backward pass
            start_back = time.time()
            loss.backward()
            backward_time = backward_time + time.time()-start_back
            # Optimize the weights
            optimizer.step()

            total_loss.append(loss.item())
            per_batch_time = per_batch_time + time.time()-start_batch

            batch_count = batch_count + 1
        epoch_time.append(time.time()-start_epoch)
        loss_list.append(sum(total_loss)/len(total_loss))
        print('Training [{:.0f}%]\tLoss: {:.4f}'.format(
            100. * (epoch + 1) / epochs, loss_list[-1]))
        print(f'Epoch Time: {epoch_time[epoch]}, Batch Count: {batch_count},  Batch avg. Time: {per_batch_time/batch_count}, Backwardpass avg. Time: {backward_time/batch_count}')
    return total_loss, epoch_time

n_samples = 100

X_train = datasets.MNIST(root='./data', train=True, download=True,
                         transform=transforms.Compose([transforms.ToTensor()]))

# Leaving only labels 0 and 1 
idx = np.append(np.where(X_train.targets == 0)[0][:n_samples], 
                np.where(X_train.targets == 1)[0][:n_samples])

X_train.data = X_train.data[idx]
X_train.targets = X_train.targets[idx]

train_loader = torch.utils.data.DataLoader(X_train, batch_size=1, shuffle=True)

dev = 'CPU'

#provider = IBMQ.load_account()
#backend = provider.get_backend('ibmq_athens')
#aersim_backend = AerSimulator.from_backend(backend)

stateVec_aersim_backend = StatevectorSimulator()
stateVec_aer_loss_list, stateVec_aer_epoch_time = train(backend=stateVec_aersim_backend,epochs=10, train_loader=train_loader)

Qasm_aersim_backend = QasmSimulator(method='automatic')
Qasm_aer_loss_list, Qasm_aer_epoch_time = train(backend=Qasm_aersim_backend,epochs=10, train_loader=train_loader)

aersim_backend = AerSimulator(method='automatic',device = dev)
aer_loss_list, aer_epoch_time = train(backend=aersim_backend,epochs=10, train_loader=train_loader)

basicaer_backend = BasicAer.get_backend('qasm_simulator')
basicaer_loss_list, basicaer_epoch_time = train(backend=basicaer_backend,epochs=10, train_loader=train_loader)

epochs_list = np.arange(1,len(basicaer_epoch_time)+1)
fig = plt.figure()
ax = plt.axes()
ax.plot(epochs_list, aer_epoch_time, color='red', label='AerSimulator')
ax.plot(epochs_list, stateVec_aer_epoch_time, color='pink', label='StateVecSimulator')
ax.plot(epochs_list, Qasm_aer_epoch_time, color='green', label='QasmSimulator')
ax.plot(epochs_list, basicaer_epoch_time, color='blue', label='BasicAer')
plt.legend()
plt.title("Hybrid Quantum-Classical VQC")
plt.xlabel("Epochs")
plt.ylabel("Time (s)")
plt.savefig("Hybrid_VQC_timeVsepoch")

What is the expected behavior?

Aer simulator should have a constant simulation time in every epoch and Aer simulator should be faster than BasicAer.

chriseclectic commented 3 years ago

@maniraman-periyasamy Can you check if this issue is present on master? It might have been related to the memory issue fixed by #1200

maniraman-periyasamy commented 3 years ago

@chriseclectic The pattern is repeated even in the master (qiskit-aer 0.9.0, qiskit-terra 0.17.1). BasicAer gives a constant and faster simulation time. Whereas, AerSimulaor's simulation time is linearly increasing every epoch.

Hybrid_VQC_timeVsepoch_aer-master

chriseclectic commented 3 years ago

Thanks for checking @maniraman-periyasamy, @hhorii you look into what might be causing this?

hhorii commented 3 years ago

I guess that this is not an issue in Aer. Could you try

        t_qc = self._circuit

instead of

        t_qc = transpile(self._circuit,
                         self.backend)

?

maniraman-periyasamy commented 3 years ago

@hhorii I ran the simulator as you suggested. There are two changes to be noted.

The issue with Aer still persists. Though, using the circuit instead of transpile brought down the increase in time taken from
~12 epoch to ~ 3 epoch, and the simulation time of Aer is faster than BasicAer in the first epoch.
BasicAer couldn't be run without transpile. When I try to, the following error is thrown. "qiskit.providers.basicaer.exceptions.BasicAerError: 'qasm_simulator encountered unrecognized operation "h" ". So I had to run the basicAer using transpile (Not sure if there is any other way!).

chriseclectic commented 3 years ago

I think part of this is just the overhead of the assembly and C++ wrapping code of the simulator taking an order of magnitude more time than the actual simulation for such as small circuit (3 qubits).

Unfortunately there isn't too much we can do about that in general atm, but your specific example maybe you can rewrite it so that you are executing the whole batch of circuits and theta params in a single backend.run call rather than 1 circuit/theta per call.

For example breaking your circuit function up into a simple benchmark:

def make_circuit(n_qubits, backend):
    all_qubits = [i for i in range(n_qubits)]
    theta = qiskit.circuit.Parameter('θ')
    circuit = qiskit.QuantumCircuit(n_qubits) 
    circuit.h(all_qubits)
    circuit.barrier()
    circuit.rz(theta, all_qubits)
    circuit.measure_all()
    return transpile(circuit, backend), theta

def run_circuit(circuit, param_theta, thetas, backend, shots):
    theta_circs = [circuit.bind_parameters({param_theta: theta}) for theta in thetas]
    job = backend.run(theta_circs, shots=shots)
    result = job.result().get_counts(0)

    counts = np.array(list(result.values()))
    states = np.array(list(result.keys())).astype(float)

    # Compute probabilities for each state
    probabilities = counts / shots
    # Get state expectation
    expectation = np.sum(states * probabilities)

    return np.array([expectation])

# Example params
nq = 3
shots = 100
batch_size = 200
thetas = np.random.rand(batch_size)

# Time BasicAer
circuit, param_theta = make_circuit(nq, basicaer_backend)

# Time single theta
%timeit -r 1 run_circuit(circuit, param_theta, thetas[:1], basicaer_backend, shots)

# Time full batch w/ full exec
%timeit -r 1 run_circuit(circuit, param_theta, thetas, basicaer_backend, shots)

# Time full batch w/ 1 circ exec
%timeit -r 1 [run_circuit(circuit, param_theta, [theta], basicaer_backend, shots) for theta in thetas]

# Time Aer
circuit, param_theta = make_circuit(nq, aersim_backend)

# Time single theta
%timeit -r 1 run_circuit(circuit, param_theta, thetas[:1], aersim_backend, shots)

# Time full batch w/ full exec
%timeit -r 1 run_circuit(circuit, param_theta, thetas, aersim_backend, shots)

# Time full batch w/ 1 circ exec
%timeit -r 1 [run_circuit(circuit, param_theta, [theta], aersim_backend, shots) for theta in thetas]

In BasicAer since its all python there is not much difference between executing batch 1 at a time or all together

2.03 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 100 loops each)
455 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
413 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

With Aer that has to go through Pybind for qobj and results there is a large overhead per execution call (~75ms here) so there is a huge difference between executing the full batch in a single call vs iterating over each item in batch:

75.7 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
506 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
15.8 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

That said this still doesn't explain why the time is basically increasing each epoch as if its accumulating rather than being constant (and just larger from overhead).

chriseclectic commented 3 years ago

After further profiling I think I found the culprit (fix in #1232). There was an internal bug in the backend class with how basis gates are computed that was resulting a side effects causing larger and larger set comparisons to happen each execution. This seems to be the main reason for the linear increase in time.

After fixing this I get this when running your script (with a tweak to avoid transpiling every run call by putting self._transpiled_circuit = transpile(self._circuit, backend) in the init`)

This seems to fix the increase bug, and then the constant offset is now just the C++ pybinding overhead.

Hybrid_VQC_timeVsepoch

Re-running my above benchmark with this fix the overhead for Aer has been greatly reduced as well:

2.9 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 100 loops each)
435 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
626 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Qiskit / qiskit-aer