Qiskit / qiskit-aer

Aer is a high performance simulator for quantum circuits that includes noise models
https://qiskit.github.io/qiskit-aer/
Apache License 2.0
505 stars 362 forks source link

[Follow up] Issue-1721: GPU low clock usage #1919

Open rbnfrhnr opened 1 year ago

rbnfrhnr commented 1 year ago

Informations

This is a follow-up to Issue-1721: GPU low clock usage. I wanted to ask if there has been any progress on enabling batching over multiple circuits on GPU as mentioned by @doichanj

What is the current behavior?

Aer sampler using GPU appears to not be optimized for executing multiple circuits and parameters as GPU usage only makes up for a relatively small fraction of the sampler.run() command.

Steps to reproduce the problem

Subsequent code can be executed to reproduce the behavior.

from time import time

import numpy as np
from qiskit import QuantumRegister, ClassicalRegister, QuantumCircuit
from qiskit.circuit.library import RealAmplitudes
from qiskit.utils import algorithm_globals
from qiskit_aer.primitives import Sampler as AerSampler

# quantum autoencoder ansatz
def auto_encoder_circuit(num_latent: int, num_trash: int, depth: int = 5) -> QuantumCircuit:
    qr = QuantumRegister(num_latent + 2 * num_trash + 1, "q")
    cr = ClassicalRegister(1, "c")
    circuit = QuantumCircuit(qr, cr)
    encoder = RealAmplitudes(num_latent + num_trash, reps=depth)

    circuit.compose(encoder, range(0, num_latent + num_trash), inplace=True)
    circuit.barrier()
    auxiliary_qubit = num_latent + 2 * num_trash

    circuit.h(auxiliary_qubit)
    for i in range(num_trash):
        circuit.cswap(auxiliary_qubit, num_latent + i, num_latent + num_trash + i)

    circuit.h(auxiliary_qubit)
    circuit.measure(auxiliary_qubit, cr[0])
    return circuit

n = 1500
ansatz_depth = 5
latent_space_qubits = 6
trash_space_qubits = 1

ae = auto_encoder_circuit(latent_space_qubits, trash_space_qubits, depth=ansatz_depth)
# circuit for data encoding (amplitude encoding)
qc_ae_training = QuantumCircuit(latent_space_qubits + 2 * trash_space_qubits + 1, 1)
qc_ae_training = qc_ae_training.compose(ae)

# training data of size 128, L2-normalized for amplitude encoding
train_data = np.random.random(size=(n, 128))
train_data = train_data / np.linalg.norm(train_data, axis=1).reshape(-1, 1)

# initial param value for encoder
param_values = algorithm_globals.random.random(len(qc_ae_training.parameters))

def build_circ(x: np.ndarray) -> QuantumCircuit:
    circ = QuantumCircuit(qc_ae_training.qubits)
    # initialize with data record as amplitude encoding
    circ.initialize(x, np.arange(0, latent_space_qubits + trash_space_qubits).tolist())
    circ = circ.compose(qc_ae_training)
    return circ

# create one circuit for each record and initialize it using amplitude encoding -> 1500 circuits
circs = [build_circ(x) for x in train_data]

for dev in ['CPU', 'GPU']:
    s = time()
    sampler = AerSampler(run_options={"method": "statevector", "device": dev})
    job = sampler.run(circs, [param_values] * train_data.shape[0])
    result = job.result()
    duration = time() - s
    print('{} time (s)'.format(dev), duration)

Above code outpus:

CPU time (s) 14.341666221618652
GPU time (s) 20.188242197036743

According to nvidia-smi, actual GPU usage only makes up ~5 seconds.

What is the expected behavior?

GPU also accelerates execution of multiple circuits.

Suggested solutions

Thank you, any suggestion on how to optimize multi-circuit execution is very much appreciated.

doichanj commented 1 year ago

GPU optimization to parameterized circuits is implemented in #1901, but we found issue in AerSampler and currently this optimization is only available for AerEstimator. The fix for AerSampler will be provided

doichanj commented 1 year ago

By combining PR #1901 and #1935 Aer can not accelerate this example because this example passes only 1 parameter per circuit. Aer can only accelerate cases which passes multiple parameters per circuit at this time