Closed rsln-s closed 3 years ago
Curiously, on a different machine (Intel Xeon E5-2695v4) the simulator uses all cores irregardless of the values of max_parallel_experiments
or max_parallel_threads
. A slightly more complicated example that uses deeper circuits crushes with segmentation fault despite only using ~20% of the available 126 GB of memory at peak (see #1289). Repeating the example that uses all 36 cores and segfaults despite setting max_parallel_experiments=2, max_parallel_threads=2
:
import numpy as np
from qiskit.circuit.library import ZZFeatureMap
from qiskit.providers.aer import StatevectorSimulator
from qiskit.utils import QuantumInstance
from qiskit_machine_learning.kernels import QuantumKernel
from qiskit_machine_learning.datasets import digits
digits_dimension = 20
train_features, train_labels, test_features, test_labels = digits(
training_size=100,
test_size=20,
n=digits_dimension
)
digits_feature_map = ZZFeatureMap(feature_dimension=digits_dimension,
reps=2, entanglement='linear')
digits_backend = QuantumInstance(StatevectorSimulator(max_parallel_experiments=2, max_parallel_threads=2))
digits_kernel = QuantumKernel(feature_map=digits_feature_map, quantum_instance=digits_backend)
digits_matrix_train = digits_kernel.evaluate(x_vec=train_features)
print("Done computing train matrix")
digits_matrix_test = digits_kernel.evaluate(x_vec=test_features,
y_vec=train_features)
print("Done computing test matrix")
StatevectorSimulator does not ignore max_parallel_experiments
. I believe that your code is struggling before simulation.
Multiple circuits are transferred from python to C++, and then run simulation. If the number of circuits is large, transpilation (python), assembly to qobj (python), and deserialization of qobj(C++) become the bottleneck before simulation.
I'm not sure that 10000 circuits are realistic number as a request. However, I believe that #1266 will help in the near future.
@hhorii just to confirm, you are correct, the flag is not ignored in the original example in macOS. The first code gives the following performance (all time in seconds):
Using QuantumInstance(StatevectorSimulator(max_parallel_experiments=nthreads, max_parallel_threads=nthreads)), 5k circuits
% for i in $(seq 1 3 16); do python test.py $i; done
Done computing in 369.03213477134705 using 1 threads
Done computing in 263.2407970428467 using 4 threads
Done computing in 255.8174307346344 using 7 threads
Done computing in 258.7534546852112 using 10 threads
Done computing in 257.34792590141296 using 13 threads
Done computing in 258.60648226737976 using 16 threads
and
Using AerSimulator(method="statevector", max_parallel_experiments=nthreads, max_parallel_threads=nthreads), 10k circuits
% for i in $(seq 1 3 16); do python test.py $i; done
Done computing in 399.38044691085815 using 1 threads
Done computing in 232.18665313720703 using 4 threads
Done computing in 227.30876183509827 using 7 threads
Done computing in 228.06856298446655 using 10 threads
On Linux though, the simulator does appear to use all cores irregardless of the passed flags, but that is a problem for another issue.
Informations
What is the current behavior?
Even when
max_parallel_experiments
flag is passed, theStatevectorSimulator
only uses one core when asked to execute a large number of circuits, as can be easily verified usinghtop
.Steps to reproduce the problem
The following example computes the statevectors of 10,000 10-qubit states composed of Haar-random single qubit states.
Same behavior occurs when using the
StatevectorSimulator
backend directly:What is the expected behavior?
Multiple cores should be used as specified by
max_parallel_experiments
flag.Suggested solutions
N/A