Qiskit / qiskit-aer

Aer is a high performance simulator for quantum circuits that includes noise models
https://qiskit.github.io/qiskit-aer/
Apache License 2.0
480 stars 354 forks source link

Expected multi threaded and GPU performance of the different simulators should be documented. #1581

Open dietmarwo opened 2 years ago

dietmarwo commented 2 years ago

It would be helpful if there would be some hint about in which context to configure what simulator option, since the resulting performance is sometimes quite "surprizing".

Of course this is somehow CPU / GPU dependent, but I think the results don't differ much for typical modern many core CPUs.

Did some tests myself for a 8-36 qubit inverse fourier transform, see Simulation benchmark Table 1: Simulation benchmark . The table was produced using the code at https://gist.github.com/dietmarwo/23d30a89018d62c02294525092093671 on Linux Mint 20.3 / 16 core AMD 5950x CPU / NVIDIA 1660TI GPU. Used version: {'qiskit-terra': '0.21.1', 'qiskit-aer': '0.10.4', 'qiskit-ignis': '0.7.1', 'qiskit-ibmq-provider': '0.19.2', 'qiskit': '0.37.1', 'qiskit-nature': None, 'qiskit-finance': None, 'qiskit-optimization': None, 'qiskit-machine-learning': None}

simulator options time 8 qbits time 12 qbits time 18 qbits time 24 qbits time 30 qbits time 36 qbits
aer_simulator none 0.90 2.11 3.43 28.25 1427.3 14.14
aer_simulator max_parallel_threads=1 0.91 1.82 4.28 111.28 9035.0 12.46
aer_simulator device='GPU' 0.87 1.56 3.45 19.7 cuda error 13.89
qasm_simulator none 0.89 1.60 2.93 29.12 1434.6 14.38
qasm_simulator max_parallel_threads=1 0.90 1.60 4.09 110.66 9028.2 13.02
qasm_simulator device='GPU' 0.87 1.56 3.14 19.83 cuda error 14.49
aer_simulator_statevector none 0.91 1.58 3.61 28.85 1430.8 -
aer_simulator_statevector max_parallel_threads=1 0.89 1.6 3.88 110.4 9022.1 -
aer_simulator_statevector device='GPU' 0.87 1.56 2.96 19.31 cuda error -
aer_simulator_density_matrix none 0.91 10.06 - - - -
aer_simulator_density_matrix max_parallel_threads=1 0.89 34.15 - - - -
aer_simulator_density_matrix device='GPU' 0.87 4.01 - - - -

What I don't understand:

I needed this information for the configuration of a parallel optimization algorithm using a quiskit simulator inside the fitness function. Bad simulation scaling means it is better to execute them single threaded and use optimization parallelization instead. But may be the simulators can be configured to scale better and I missed something?

dotslaser commented 2 years ago

Hi! I know this is a bit out of topic, but have you been able to do these benchmarks in parallel using multiple GPUs with MPI protocol?

dietmarwo commented 2 years ago

Unfortunately not, currently I am using only one GPU. What I did was specific to my environment. Would be nice if the enhanced documentation would be more generic / complete if possible. It is difficult to predict what a parameter change does partly because qiskits multithreading is done inside the shared library controller_wrappers.cpython-39-x86_64-linux-gnu.so where users have limited insight. For qasm and aer simulation I anyway don't expect much gain from multiple GPUs.

hhorii commented 2 years ago

Performance depends on system configuration. Basically, in statevector simulation, simulation time will be 2x longer if 1 qubit is increased. Therefore, 30 qubits simulation will be 64x longer than 24 qubits in general.

GPU has overhead for its initialization. Therefore, for few qubits, GPU is not effective. Computation cost of 12 qubits of density matrix is same with 24 qubits of statevector. GPU can work well for 12qubit density matrix.

I guess 36-qubits simulation do not work well.

Finally, QFT is a typical workload but it is better to use more application. We will show some documentation for performance in near future.

dietmarwo commented 2 years ago

Thanks for the information.

We will show some documentation for performance in near future.

Looking forward to that. At https://github.com/dietmarwo/fast-cma-es/blob/master/tutorials/Quant.adoc#vqe-variational-quantum-eigensolver I wrote something about configuring parallelization of optimization of VQEs. Good scaling cannot be achieved using qiskits own optimizers. But there are alternatives available.