Expected multi threaded and GPU performance of the different simulators should be documented.

dietmarwo commented 2 years ago

It would be helpful if there would be some hint about in which context to configure what simulator option, since the resulting performance is sometimes quite "surprizing".

Of course this is somehow CPU / GPU dependent, but I think the results don't differ much for typical modern many core CPUs.

Did some tests myself for a 8-36 qubit inverse fourier transform, see Simulation benchmark Table 1: Simulation benchmark . The table was produced using the code at https://gist.github.com/dietmarwo/23d30a89018d62c02294525092093671 on Linux Mint 20.3 / 16 core AMD 5950x CPU / NVIDIA 1660TI GPU. Used version: {'qiskit-terra': '0.21.1', 'qiskit-aer': '0.10.4', 'qiskit-ignis': '0.7.1', 'qiskit-ibmq-provider': '0.19.2', 'qiskit': '0.37.1', 'qiskit-nature': None, 'qiskit-finance': None, 'qiskit-optimization': None, 'qiskit-machine-learning': None}

simulator	options	time 8 qbits	time 12 qbits	time 18 qbits	time 24 qbits	time 30 qbits	time 36 qbits
aer_simulator	none	0.90	2.11	3.43	28.25	1427.3	14.14
aer_simulator	max_parallel_threads=1	0.91	1.82	4.28	111.28	9035.0	12.46
aer_simulator	device='GPU'	0.87	1.56	3.45	19.7	cuda error	13.89
qasm_simulator	none	0.89	1.60	2.93	29.12	1434.6	14.38
qasm_simulator	max_parallel_threads=1	0.90	1.60	4.09	110.66	9028.2	13.02
qasm_simulator	device='GPU'	0.87	1.56	3.14	19.83	cuda error	14.49
aer_simulator_statevector	none	0.91	1.58	3.61	28.85	1430.8	-
aer_simulator_statevector	max_parallel_threads=1	0.89	1.6	3.88	110.4	9022.1	-
aer_simulator_statevector	device='GPU'	0.87	1.56	2.96	19.31	cuda error	-
aer_simulator_density_matrix	none	0.91	10.06	-	-	-	-
aer_simulator_density_matrix	max_parallel_threads=1	0.89	34.15	-	-	-	-
aer_simulator_density_matrix	device='GPU'	0.87	4.01	-	-	-	-

What I don't understand:

Why is it faster for 36 qubits than for 24 qubits?
Why is there no GPU scaling for <= 18 qubits beside for aer_simulator_density_matrix?
Why does the time grow so fast for 24 and 30 qubits?

I needed this information for the configuration of a parallel optimization algorithm using a quiskit simulator inside the fitness function. Bad simulation scaling means it is better to execute them single threaded and use optimization parallelization instead. But may be the simulators can be configured to scale better and I missed something?

dotslaser commented 2 years ago

Hi! I know this is a bit out of topic, but have you been able to do these benchmarks in parallel using multiple GPUs with MPI protocol?

dietmarwo commented 2 years ago

Unfortunately not, currently I am using only one GPU. What I did was specific to my environment. Would be nice if the enhanced documentation would be more generic / complete if possible. It is difficult to predict what a parameter change does partly because qiskits multithreading is done inside the shared library controller_wrappers.cpython-39-x86_64-linux-gnu.so where users have limited insight. For qasm and aer simulation I anyway don't expect much gain from multiple GPUs.

hhorii commented 2 years ago

Performance depends on system configuration. Basically, in statevector simulation, simulation time will be 2x longer if 1 qubit is increased. Therefore, 30 qubits simulation will be 64x longer than 24 qubits in general.

GPU has overhead for its initialization. Therefore, for few qubits, GPU is not effective. Computation cost of 12 qubits of density matrix is same with 24 qubits of statevector. GPU can work well for 12qubit density matrix.

I guess 36-qubits simulation do not work well.

Finally, QFT is a typical workload but it is better to use more application. We will show some documentation for performance in near future.

dietmarwo commented 2 years ago

Thanks for the information.

We will show some documentation for performance in near future.

Looking forward to that. At https://github.com/dietmarwo/fast-cma-es/blob/master/tutorials/Quant.adoc#vqe-variational-quantum-eigensolver I wrote something about configuring parallelization of optimization of VQEs. Good scaling cannot be achieved using qiskits own optimizers. But there are alternatives available.

Qiskit / qiskit-aer

Expected multi threaded and GPU performance of the different simulators should be documented. #1581