Open dietmarwo opened 2 years ago
Hi! I know this is a bit out of topic, but have you been able to do these benchmarks in parallel using multiple GPUs with MPI protocol?
Unfortunately not, currently I am using only one GPU. What I did was specific to my environment. Would be nice if the enhanced documentation would be more generic / complete if possible. It is difficult to predict what a parameter change does partly because qiskits multithreading is done inside the shared library controller_wrappers.cpython-39-x86_64-linux-gnu.so where users have limited insight. For qasm and aer simulation I anyway don't expect much gain from multiple GPUs.
Performance depends on system configuration. Basically, in statevector simulation, simulation time will be 2x longer if 1 qubit is increased. Therefore, 30 qubits simulation will be 64x longer than 24 qubits in general.
GPU has overhead for its initialization. Therefore, for few qubits, GPU is not effective. Computation cost of 12 qubits of density matrix is same with 24 qubits of statevector. GPU can work well for 12qubit density matrix.
I guess 36-qubits simulation do not work well.
Finally, QFT is a typical workload but it is better to use more application. We will show some documentation for performance in near future.
Thanks for the information.
We will show some documentation for performance in near future.
Looking forward to that. At https://github.com/dietmarwo/fast-cma-es/blob/master/tutorials/Quant.adoc#vqe-variational-quantum-eigensolver I wrote something about configuring parallelization of optimization of VQEs. Good scaling cannot be achieved using qiskits own optimizers. But there are alternatives available.
It would be helpful if there would be some hint about in which context to configure what simulator option, since the resulting performance is sometimes quite "surprizing".
Of course this is somehow CPU / GPU dependent, but I think the results don't differ much for typical modern many core CPUs.
Did some tests myself for a 8-36 qubit inverse fourier transform, see Simulation benchmark Table 1: Simulation benchmark . The table was produced using the code at https://gist.github.com/dietmarwo/23d30a89018d62c02294525092093671 on Linux Mint 20.3 / 16 core AMD 5950x CPU / NVIDIA 1660TI GPU. Used version: {'qiskit-terra': '0.21.1', 'qiskit-aer': '0.10.4', 'qiskit-ignis': '0.7.1', 'qiskit-ibmq-provider': '0.19.2', 'qiskit': '0.37.1', 'qiskit-nature': None, 'qiskit-finance': None, 'qiskit-optimization': None, 'qiskit-machine-learning': None}
What I don't understand:
I needed this information for the configuration of a parallel optimization algorithm using a quiskit simulator inside the fitness function. Bad simulation scaling means it is better to execute them single threaded and use optimization parallelization instead. But may be the simulators can be configured to scale better and I missed something?