Qiskit / qiskit-aer

Aer is a high performance simulator for quantum circuits that includes noise models
https://qiskit.github.io/qiskit-aer/
Apache License 2.0
483 stars 358 forks source link

Bug in circuit distribution using MPI on multiple GPUs #1583

Closed dotslaser closed 1 year ago

dotslaser commented 2 years ago

Informations

What is the current behavior?

Segmentation error when running a quantum circuit on GPU with multiple processes (using MPI). I found a partial (but annoying) solution to this problem:

Steps to reproduce the problem

I'm using NVIDIA GPUs in AWS (Amazon Web Services) instances. This are the system specs:

AWS specs: g5.xlarge instances

NVIDIA CUDA specifications:

image

NVIDIA CUDA installation steps:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600 wget https://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda-repo-ubuntu2004-11-7-local_11.7.0-515.43.04-1_amd64.deb sudo dpkg -i cuda-repo-ubuntu2004-11-7-local_11.7.0-515.43.04-1_amd64.deb sudo cp /var/cuda-repo-ubuntu2004-11-7-local/cuda-*-keyring.gpg /usr/share/keyrings/ sudo apt-get update sudo apt-get -y install cuda

Qiskit Aer Compilation:

python ./setup.py bdist_wheel -- -DAER_MPI=True -DAER_THRUST_BACKEND=CUDA

Simple Quantum Circuit for tests:

This circuit has only Hadamard gates.

Test 1 (2 processes):

Code:

from qiskit import execute, QuantumCircuit
from qiskit.providers.aer import AerSimulator
qubits = 24 #number of qubits
blocking_qubits = 23 # 24-1 blocking qubits
sim = AerSimulator(method="statevector", device="GPU")
circ = QuantumCircuit(qubits)
for i in range(qubits):
        circ.h(i) # simple circuit with one Hadamard gate in each qubit        
circ.measure_all()
result = execute(circ, sim, shots=10, blocking_enable=True, blocking_qubits=blocking_qubits).result()
print(result)

Execution:

mpirun -np 2 -host  172.31.43.188:2 python /home/ubuntu/example_circuit.py 

Result:

[ip-172-31-43-188:04164] Read -1, expected 67108864, errno = 14
[ip-172-31-43-188:04164]  Process received signal 
[ip-172-31-43-188:04164] Signal: Segmentation fault (11)
[ip-172-31-43-188:04164] Signal code: Invalid permissions (2)
[ip-172-31-43-188:04164] Failing at address: 0x7fbcac000000
[ip-172-31-43-188:04165] Read -1, expected 67108864, errno = 14
[ip-172-31-43-188:04165]  Process received signal 
[ip-172-31-43-188:04165] Signal: Segmentation fault (11)
[ip-172-31-43-188:04165] Signal code: Invalid permissions (2)
[ip-172-31-43-188:04165] Failing at address: 0x7f1ff8000000
[ip-172-31-43-188:04164] [ 0] [ip-172-31-43-188:04165] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7fbd16244090]
[ip-172-31-43-188:04164] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x18b8f5)[0x7fbd1638c8f5]
[ip-172-31-43-188:04164] /lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f2063a94090]
[ip-172-31-43-188:04165] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x18b8f5)[0x7f2063bdc8f5]
[ip-172-31-43-188:04165] [ 2] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so(+0x31c4)[0x7f20516cf1c4]
[ip-172-31-43-188:04165] [ 3] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so(mca_pml_ob1_send_request_schedule_once+0x1c6)[0x7f20516b1926]
[ip-172-31-43-188:04165] [ 4] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_ack+0x1a9)[0x7f20516aa429]
[ip-172-31-43-188:04165] [ 5] [ 2] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so(+0x31c4)[0x7fbd0427f1c4]
[ip-172-31-43-188:04164] [ 3] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so(mca_pml_ob1_send_request_schedule_once+0x1c6)[0x7fbd04261926]
[ip-172-31-43-188:04164] [ 4] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_ack+0x1a9)[0x7fbd0425a429]
[ip-172-31-43-188:04164] [ 5] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so(mca_btl_vader_poll_handle_frag+0x95)[0x7fbd04280ed5]
[ip-172-31-43-188:04164] [ 6] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so(+0x53a3)[0x7fbd042813a3]
[ip-172-31-43-188:04164] [ 7] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so(mca_btl_vader_poll_handle_frag+0x95)[0x7f20516d0ed5]
[ip-172-31-43-188:04165] [ 6] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so(+0x53a3)[0x7f20516d13a3]
[ip-172-31-43-188:04165] [ 7] /lib/x86_64-linux-gnu/libopen-pal.so.40(opal_progress+0x34)[0x7f205526a854]
[ip-172-31-43-188:04165] [ 8] /lib/x86_64-linux-gnu/libopen-pal.so.40(opal_progress+0x34)[0x7fbd07a1a854]
[ip-172-31-43-188:04164] [ 8] /lib/x86_64-linux-gnu/libopen-pal.so.40(ompi_sync_wait_mt+0xb5)[0x7fbd07a21315]
[ip-172-31-43-188:04164] [ 9] /lib/x86_64-linux-gnu/libopen-pal.so.40(ompi_sync_wait_mt+0xb5)[0x7f2055271315]
[ip-172-31-43-188:04165] [ 9] /lib/x86_64-linux-gnu/libmpi.so.40(ompi_request_default_wait+0x228)[0x7f20557019f8]
[ip-172-31-43-188:04165] [10] /lib/x86_64-linux-gnu/libmpi.so.40(ompi_request_default_wait+0x228)[0x7fbd07eb19f8]
[ip-172-31-43-188:04164] [10] /lib/x86_64-linux-gnu/libmpi.so.40(PMPI_Wait+0x58)[0x7f2055744a88]
[ip-172-31-43-188:04165] [11] /home/ubuntu/.local/lib/python3.8/site-packages/qiskit/providers/aer/backends/controller_wrappers.cpython-38-x86_64-linux-gnu.so(+0x271e89)[0x7f2057bd7e89]
[ip-172-31-43-188:04165] [12] /home/ubuntu/.local/lib/python3.8/site-packages/qiskit/providers/aer/backends/controller_wrappers.cpython-38-x86_64-linux-gnu.so(+0x27019d)[0x7f2057bd619d]
[ip-172-31-43-188:04165] [13] /home/ubuntu/.local/lib/python3.8/site-packages/qiskit/providers/aer/backends/controller_wrappers.cpython-38-x86_64-linux-gnu.so(+0xe4bb9)[0x7f2057a4abb9]
[ip-172-31-43-188:04165] [14] /home/ubuntu/.local/lib/python3.8/site-packages/qiskit/providers/aer/backends/controller_wrappers.cpython-38-x86_64-linux-gnu.so(+0x4343dc)[0x7f2057d9a3dc]
[ip-172-31-43-188:04165] [15] /lib/x86_64-linux-gnu/libmpi.so.40(PMPI_Wait+0x58)[0x7fbd07ef4a88]
[ip-172-31-43-188:04164] [11] /home/ubuntu/.local/lib/python3.8/site-packages/qiskit/providers/aer/backends/controller_wrappers.cpython-38-x86_64-linux-gnu.so(+0x271e89)[0x7fbd0a387e89]
[ip-172-31-43-188:04164] [12] /home/ubuntu/.local/lib/python3.8/site-packages/qiskit/providers/aer/backends/controller_wrappers.cpython-38-x86_64-linux-gnu.so(+0x4362c4)[0x7f2057d9c2c4]
[ip-172-31-43-188:04165] [16] /home/ubuntu/.local/lib/python3.8/site-packages/qiskit/providers/aer/backends/controller_wrappers.cpython-38-x86_64-linux-gnu.so(+0x436ce9)[0x7f2057d9cce9]
[ip-172-31-43-188:04165] [17] /home/ubuntu/.local/lib/python3.8/site-packages/qiskit/providers/aer/backends/controller_wrappers.cpython-38-x86_64-linux-gnu.so(+0xe5cc1)[0x7f2057a4bcc1]
[ip-172-31-43-188:04165] [18] /home/ubuntu/.local/lib/python3.8/site-packages/qiskit/providers/aer/backends/controller_wrappers.cpython-38-x86_64-linux-gnu.so(+0x43dbce)[0x7f2057da3bce]
[ip-172-31-43-188:04165] [19] /home/ubuntu/.local/lib/python3.8/site-packages/qiskit/providers/aer/backends/controller_wrappers.cpython-38-x86_64-linux-gnu.so(+0x43f420)[0x7f2057da5420]
[ip-172-31-43-188:04165] [20] /home/ubuntu/.local/lib/python3.8/site-packages/qiskit/providers/aer/backends/controller_wrappers.cpython-38-x86_64-linux-gnu.so(+0x43f6f2)[0x7f2057da56f2]
[ip-172-31-43-188:04165] [21] /home/ubuntu/.local/lib/python3.8/site-packages/qiskit/providers/aer/backends/controller_wrappers.cpython-38-x86_64-linux-gnu.so(+0x27019d)[0x7fbd0a38619d]
[ip-172-31-43-188:04164] [13] /home/ubuntu/.local/lib/python3.8/site-packages/qiskit/providers/aer/backends/controller_wrappers.cpython-38-x86_64-linux-gnu.so(+0xe4bb9)[0x7fbd0a1fabb9]
[ip-172-31-43-188:04164] [14] /home/ubuntu/.local/lib/python3.8/site-packages/qiskit/providers/aer/backends/controller_wrappers.cpython-38-x86_64-linux-gnu.so(+0x4343dc)[0x7fbd0a54a3dc]
[ip-172-31-43-188:04164] [15] /home/ubuntu/.local/lib/python3.8/site-packages/qiskit/providers/aer/backends/controller_wrappers.cpython-38-x86_64-linux-gnu.so(+0x4362c4)[0x7fbd0a54c2c4]
[ip-172-31-43-188:04164] [16] /home/ubuntu/.local/lib/python3.8/site-packages/qiskit/providers/aer/backends/controller_wrappers.cpython-38-x86_64-linux-gnu.so(+0x436ce9)[0x7fbd0a54cce9]
[ip-172-31-43-188:04164] /home/ubuntu/.local/lib/python3.8/site-packages/qiskit/providers/aer/backends/controller_wrappers.cpython-38-x86_64-linux-gnu.so(+0x1a280d)[0x7f2057b0880d]
[ip-172-31-43-188:04165] [22] /home/ubuntu/.local/lib/python3.8/site-packages/qiskit/providers/aer/backends/controller_wrappers.cpython-38-x86_64-linux-gnu.so(+0x1a2f44)[0x7f2057b08f44]
[ip-172-31-43-188:04165] [23] python(PyCFunction_Call+0x59)[0x5f6929]
[ip-172-31-43-188:04165] [24] [17] /home/ubuntu/.local/lib/python3.8/site-packages/qiskit/providers/aer/backends/controller_wrappers.cpython-38-x86_64-linux-gnu.so(+0xe5cc1)[0x7fbd0a1fbcc1]
[ip-172-31-43-188:04164] [18] /home/ubuntu/.local/lib/python3.8/site-packages/qiskit/providers/aer/backends/controller_wrappers.cpython-38-x86_64-linux-gnu.so(+0x43dbce)[0x7fbd0a553bce]
[ip-172-31-43-188:04164] [19] /home/ubuntu/.local/lib/python3.8/site-packages/qiskit/providers/aer/backends/controller_wrappers.cpython-38-x86_64-linux-gnu.so(+0x43f420)[0x7fbd0a555420]
[ip-172-31-43-188:04164] [20] /home/ubuntu/.local/lib/python3.8/site-packages/qiskit/providers/aer/backends/controller_wrappers.cpython-38-x86_64-linux-gnu.so(+0x43f6f2)[0x7fbd0a5556f2]
[ip-172-31-43-188:04164] [21] /home/ubuntu/.local/lib/python3.8/site-packages/qiskit/providers/aer/backends/controller_wrappers.cpython-38-x86_64-linux-gnu.so(+0x1a280d)[0x7fbd0a2b880d]
[ip-172-31-43-188:04164] [22] /home/ubuntu/.local/lib/python3.8/site-packages/qiskit/providers/aer/backends/controller_wrappers.cpython-38-x86_64-linux-gnu.so(+0x1a2f44)[0x7fbd0a2b8f44]
[ip-172-31-43-188:04164] [23] python(_PyObject_MakeTpCall+0x296)[0x5f74f6]
[ip-172-31-43-188:04165] [25] python(PyCFunction_Call+0x59)[0x5f6929]
[ip-172-31-43-188:04164] [24] python(_PyObject_MakeTpCall+0x296)[0x5f74f6]
[ip-172-31-43-188:04164] [25] python[0x50c358]
[ip-172-31-43-188:04165] [26] python(PyObject_Call+0x62)[0x5f6082]
[ip-172-31-43-188:04165] [27] python[0x59dbac]
[ip-172-31-43-188:04165] [28] python(_PyObject_MakeTpCall+0x296)[0x5f74f6]
[ip-172-31-43-188:04165] [29] python(_PyEval_EvalFrameDefault+0x59b5)[0x570d55]
[ip-172-31-43-188:04165]  End of error message 
python[0x50c358]
[ip-172-31-43-188:04164] [26] python(PyObject_Call+0x62)[0x5f6082]
[ip-172-31-43-188:04164] [27] python[0x59dbac]
[ip-172-31-43-188:04164] [28] python(_PyObject_MakeTpCall+0x296)[0x5f74f6]
[ip-172-31-43-188:04164] [29] python(_PyEval_EvalFrameDefault+0x59b5)[0x570d55]
[ip-172-31-43-188:04164]  End of error message 
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 4165 on node 172.31.43.188 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

If I execute the exact same code in 2 machines (same AWS instances, each one process instead of 2):

Execution:

mpirun -np 2 -host  172.31.43.188,172.31.43.59 python /home/ubuntu/example_circuit.py 

Result:

[ip-172-31-43-59][[62324,1],1][btl_tcp_frag.c:128:mca_btl_tcp_frag_send] mca_btl_tcp_frag_send: writev error (0x1d7d718, 8)
    Bad address(3)

[ip-172-31-43-188][[62324,1],0][btl_tcp_frag.c:128:mca_btl_tcp_frag_send] mca_btl_tcp_frag_send: writev error (0x2ebeb98, 8)
    Bad address(3)

Test 2 (same number of qubits and same circuit as the previous case, except now the last qubit does not have a Hadamard gate):

Code:

from qiskit import execute, QuantumCircuit
from qiskit.providers.aer import AerSimulator
qubits = 24
blank_qubits = 1 # 2^1 processes
blocking_qubits = 23 # 24-1 blocking qubits
sim = AerSimulator(method="statevector", device="GPU")
circ = QuantumCircuit(qubits)
for i in range(qubits-blank_qubits):
        circ.h(i) # now only 23 first qubits have a Hadamard gate, last one performs no operation       
circ.measure_all()
result = execute(circ, sim, shots=10, blocking_enable=True, blocking_qubits=blocking_qubits).result()
print(result)

Execution(same execution):

mpirun -np 2 -host  172.31.43.188,172.31.43.59 python /home/ubuntu/example_circuit.py 

Result:

Result(backend_name='aer_simulator_statevector_gpu', backend_version='0.11.0', qobj_id='a119c73e-1183-4dec-bd18-17fac2435cbd', job_id='2c9d99d9-5810-4bad-8401-303cf6ecaa68', success=True, results=[ExperimentResult(shots=10, success=True, meas_level=2, data=ExperimentResultData(counts={'0x4fba2': 1, '0x35929e': 1, '0x189cd5': 1, '0x42f933': 1, '0x4a11c7': 1, '0x90f95': 1, '0x3ce730': 1, '0x34c84': 1, '0x29647': 1, '0x706d3': 1}), header=QobjExperimentHeader(clbit_labels=[['meas', 0], ['meas', 1], ['meas', 2], ['meas', 3], ['meas', 4], ['meas', 5], ['meas', 6], ['meas', 7], ['meas', 8], ['meas', 9], ['meas', 10], ['meas', 11], ['meas', 12], ['meas', 13], ['meas', 14], ['meas', 15], ['meas', 16], ['meas', 17], ['meas', 18], ['meas', 19], ['meas', 20], ['meas', 21], ['meas', 22], ['meas', 23]], creg_sizes=[['meas', 24]], global_phase=0.0, memory_slots=24, metadata={}, n_qubits=24, name='circuit-80', qreg_sizes=[['q', 24]], qubit_labels=[['q', 0], ['q', 1], ['q', 2], ['q', 3], ['q', 4], ['q', 5], ['q', 6], ['q', 7], ['q', 8], ['q', 9], ['q', 10], ['q', 11], ['q', 12], ['q', 13], ['q', 14], ['q', 15], ['q', 16], ['q', 17], ['q', 18], ['q', 19], ['q', 20], ['q', 21], ['q', 22], ['q', 23]]), status=DONE, seed_simulator=3532817569, metadata={'noise': 'ideal', 'batched_shots_optimization': False, 'measure_sampling': True, 'parallel_shots': 1, 'remapped_qubits': False, 'active_input_qubits': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23], 'num_clbits': 24, 'parallel_state_update': 2, 'sample_measure_time': 0.001236201, 'num_qubits': 24, 'device': 'GPU', 'input_qubit_map': [[23, 23], [22, 22], [21, 21], [20, 20], [19, 19], [18, 18], [17, 17], [16, 16], [15, 15], [14, 14], [13, 13], [0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 7], [8, 8], [9, 9], [10, 10], [11, 11], [12, 12]], 'method': 'statevector', 'cacheblocking': {'max_multiple_chunk_swaps': 8, 'multiple_chunk_swaps_buffer_qubits': 15, 'multiple_chunk_swaps_enable': True, 'chunk_parallel_gpus': 1, 'block_bits': 23, 'enabled': True}, 'fusion': {'applied': True, 'time_taken': 0.000301452, 'cost_factor': 1.8, 'parallelization': 1, 'max_fused_qubits': 5, 'method': 'unitary', 'threshold': 14, 'enabled': True}}, time_taken=0.019187484)], date=2022-08-25T08:25:27.059144, status=COMPLETED, header=QobjHeader(backend_name='aer_simulator_statevector_gpu', backend_version='0.11.0'), metadata={'time_taken': 0.019434877, 'time_taken_execute': 0.019201394, 'mpi_rank': 0, 'num_mpi_processes': 2, 'max_gpu_memory_mb': 22586, 'max_memory_mb': 15815, 'parallel_experiments': 1, 'time_taken_load_qobj': 0.000226312, 'num_processes_per_experiments': 2, 'omp_enabled': True}, time_taken=0.022274255752563477)
Result(backend_name='aer_simulator_statevector_gpu', backend_version='0.11.0', qobj_id='b7c5e9ba-78f3-485b-ade0-c9f76c1dadde', job_id='f297b438-1843-47fb-88ee-bfa5f77fc4cc', success=True, results=[ExperimentResult(shots=10, success=True, meas_level=2, data=ExperimentResultData(counts={'0x4fba2': 1, '0x35929e': 1, '0x189cd5': 1, '0x42f933': 1, '0x4a11c7': 1, '0x90f95': 1, '0x3ce730': 1, '0x34c84': 1, '0x29647': 1, '0x706d3': 1}), header=QobjExperimentHeader(clbit_labels=[['meas', 0], ['meas', 1], ['meas', 2], ['meas', 3], ['meas', 4], ['meas', 5], ['meas', 6], ['meas', 7], ['meas', 8], ['meas', 9], ['meas', 10], ['meas', 11], ['meas', 12], ['meas', 13], ['meas', 14], ['meas', 15], ['meas', 16], ['meas', 17], ['meas', 18], ['meas', 19], ['meas', 20], ['meas', 21], ['meas', 22], ['meas', 23]], creg_sizes=[['meas', 24]], global_phase=0.0, memory_slots=24, metadata={}, n_qubits=24, name='circuit-80', qreg_sizes=[['q', 24]], qubit_labels=[['q', 0], ['q', 1], ['q', 2], ['q', 3], ['q', 4], ['q', 5], ['q', 6], ['q', 7], ['q', 8], ['q', 9], ['q', 10], ['q', 11], ['q', 12], ['q', 13], ['q', 14], ['q', 15], ['q', 16], ['q', 17], ['q', 18], ['q', 19], ['q', 20], ['q', 21], ['q', 22], ['q', 23]]), status=DONE, seed_simulator=1557309410, metadata={'noise': 'ideal', 'batched_shots_optimization': False, 'measure_sampling': True, 'parallel_shots': 1, 'remapped_qubits': False, 'active_input_qubits': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23], 'num_clbits': 24, 'parallel_state_update': 2, 'sample_measure_time': 0.00138256, 'num_qubits': 24, 'device': 'GPU', 'input_qubit_map': [[23, 23], [22, 22], [21, 21], [20, 20], [19, 19], [18, 18], [17, 17], [16, 16], [15, 15], [14, 14], [13, 13], [0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 7], [8, 8], [9, 9], [10, 10], [11, 11], [12, 12]], 'method': 'statevector', 'cacheblocking': {'max_multiple_chunk_swaps': 8, 'multiple_chunk_swaps_buffer_qubits': 15, 'multiple_chunk_swaps_enable': True, 'chunk_parallel_gpus': 1, 'block_bits': 23, 'enabled': True}, 'fusion': {'applied': True, 'time_taken': 0.00029167, 'cost_factor': 1.8, 'parallelization': 1, 'max_fused_qubits': 5, 'method': 'unitary', 'threshold': 14, 'enabled': True}}, time_taken=0.019317088)], date=2022-08-25T08:25:27.066295, status=COMPLETED, header=QobjHeader(backend_name='aer_simulator_statevector_gpu', backend_version='0.11.0'), metadata={'time_taken': 0.019580018, 'time_taken_execute': 0.019331928, 'mpi_rank': 1, 'num_mpi_processes': 2, 'max_gpu_memory_mb': 22586, 'max_memory_mb': 15815, 'parallel_experiments': 1, 'time_taken_load_qobj': 0.00024069, 'num_processes_per_experiments': 2, 'omp_enabled': True}, time_taken=0.020051240921020508)

Now it gives the expected result (output of 2 processes). It also works if the 2 processes are performed by one instance instead of 2.

Test 3 (now 4 processes, all qubits with Hadamard gate):

Code:

from qiskit import execute, QuantumCircuit
from qiskit.providers.aer import AerSimulator
qubits = 24
blocking_qubits = 22 # 24-2 blocking qubits
sim = AerSimulator(method="statevector", device="GPU")
circ = QuantumCircuit(qubits)
for i in range(qubits):
        circ.h(i)        
circ.measure_all()
result = execute(circ, sim, shots=10, blocking_enable=True, blocking_qubits=blocking_qubits).result()
print(result)

Execution:

mpirun -np 4 -host  172.31.43.188:2,172.31.43.59:2 python /home/ubuntu/example_circuit.py

Result:

[ip-172-31-43-188][[62138,1],1][btl_tcp_frag.c:128:mca_btl_tcp_frag_send] mca_btl_tcp_frag_send: writev error (0x1ae9218, 8)
    Bad address(3)

[ip-172-31-43-188][[62138,1],0][btl_tcp_frag.c:128:mca_btl_tcp_frag_send] mca_btl_tcp_frag_send: writev error (0x34ca298, 8)
    Bad address(3)

[ip-172-31-43-59][[62138,1],3][btl_tcp_frag.c:128:mca_btl_tcp_frag_send] mca_btl_tcp_frag_send: writev error (0x2f31a98, 8)
    Bad address(3)

[ip-172-31-43-59][[62138,1],2][btl_tcp_frag.c:128:mca_btl_tcp_frag_send] mca_btl_tcp_frag_send: writev error (0x1d04a98, 8)
    Bad address(3)

Test 4 (4 processes, all qubits except the last with Hadamard gate):

Code:

from qiskit import execute, QuantumCircuit
from qiskit.providers.aer import AerSimulator
qubits = 24
blank_qubits = 1
blocking_qubits = 22 # 24-2 blocking qubits
sim = AerSimulator(method="statevector", device="GPU")
circ = QuantumCircuit(qubits)
for i in range(qubits-blank_qubits):
        circ.h(i)        
circ.measure_all()
result = execute(circ, sim, shots=10, blocking_enable=True, blocking_qubits=blocking_qubits).result()
print(result)

Execution:

mpirun -np 4 -host  172.31.43.188:2,172.31.43.59:2 python /home/ubuntu/example_circuit.py

Result: Segmentation fault (similar error to previous cases, like in test 1)

Test 5 (4 processes, all qubits except the last 2 with Hadamard gate):

Code:

from qiskit import execute, QuantumCircuit
from qiskit.providers.aer import AerSimulator
qubits = 24
blank_qubits = 2
blocking_qubits = 22 # 24-2 blocking qubits
sim = AerSimulator(method="statevector", device="GPU")
circ = QuantumCircuit(qubits)
for i in range(qubits-blank_qubits):
        circ.h(i)        
circ.measure_all()
result = execute(circ, sim, shots=10, blocking_enable=True, blocking_qubits=blocking_qubits).result()
print(result)

Execution:

mpirun -np 4 -host  172.31.43.188:2,172.31.43.59:2 python /home/ubuntu/example_circuit.py

Result:

Result(backend_name='aer_simulator_statevector_gpu', backend_version='0.11.0', qobj_id='52b4755b-1539-44c2-ba71-e6220402e118', job_id='5cd3f7b2-514c-48a7-a6b4-6439347b6a63', success=True, results=[ExperimentResult(shots=10, success=True, meas_level=2, data=ExperimentResultData(counts={'0x32f659': 1, '0x3c24de': 1, '0x3635fe': 1, '0x156796': 1, '0xcbda9': 1, '0x23339f': 1, '0x321685': 1, '0x23540d': 1, '0x28a50e': 1, '0x22131a': 1}), header=QobjExperimentHeader(clbit_labels=[['meas', 0], ['meas', 1], ['meas', 2], ['meas', 3], ['meas', 4], ['meas', 5], ['meas', 6], ['meas', 7], ['meas', 8], ['meas', 9], ['meas', 10], ['meas', 11], ['meas', 12], ['meas', 13], ['meas', 14], ['meas', 15], ['meas', 16], ['meas', 17], ['meas', 18], ['meas', 19], ['meas', 20], ['meas', 21], ['meas', 22], ['meas', 23]], creg_sizes=[['meas', 24]], global_phase=0.0, memory_slots=24, metadata={}, n_qubits=24, name='circuit-80', qreg_sizes=[['q', 24]], qubit_labels=[['q', 0], ['q', 1], ['q', 2], ['q', 3], ['q', 4], ['q', 5], ['q', 6], ['q', 7], ['q', 8], ['q', 9], ['q', 10], ['q', 11], ['q', 12], ['q', 13], ['q', 14], ['q', 15], ['q', 16], ['q', 17], ['q', 18], ['q', 19], ['q', 20], ['q', 21], ['q', 22], ['q', 23]]), status=DONE, seed_simulator=2887885440, metadata={'noise': 'ideal', 'batched_shots_optimization': False, 'measure_sampling': True, 'parallel_shots': 1, 'remapped_qubits': False, 'active_input_qubits': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23], 'num_clbits': 24, 'parallel_state_update': 4, 'sample_measure_time': 0.020044281, 'num_qubits': 24, 'device': 'GPU', 'input_qubit_map': [[23, 23], [22, 22], [21, 21], [20, 20], [19, 19], [18, 18], [17, 17], [16, 16], [15, 15], [14, 14], [13, 13], [0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 7], [8, 8], [9, 9], [10, 10], [11, 11], [12, 12]], 'method': 'statevector', 'cacheblocking': {'max_multiple_chunk_swaps': 7, 'multiple_chunk_swaps_buffer_qubits': 15, 'multiple_chunk_swaps_enable': True, 'chunk_parallel_gpus': 1, 'block_bits': 22, 'enabled': True}, 'fusion': {'applied': True, 'time_taken': 0.000289966, 'cost_factor': 1.8, 'parallelization': 1, 'max_fused_qubits': 5, 'method': 'unitary', 'threshold': 14, 'enabled': True}}, time_taken=0.057595495)], date=2022-08-25T13:36:34.560273, status=COMPLETED, header=QobjHeader(backend_name='aer_simulator_statevector_gpu', backend_version='0.11.0'), metadata={'time_taken': 0.057854789, 'time_taken_execute': 0.057611665, 'mpi_rank': 2, 'num_mpi_processes': 4, 'max_gpu_memory_mb': 22586, 'max_memory_mb': 15815, 'parallel_experiments': 1, 'time_taken_load_qobj': 0.000235604, 'num_processes_per_experiments': 4, 'omp_enabled': True}, time_taken=0.0584716796875)
Result(backend_name='aer_simulator_statevector_gpu', backend_version='0.11.0', qobj_id='b7c59ecc-8e6e-4c00-b605-6e562e61193e', job_id='912e3390-5cfd-49ae-ac9d-c61cd2824cf3', success=True, results=[ExperimentResult(shots=10, success=True, meas_level=2, data=ExperimentResultData(counts={'0x32f659': 1, '0x3c24de': 1, '0x3635fe': 1, '0x156796': 1, '0xcbda9': 1, '0x23339f': 1, '0x321685': 1, '0x23540d': 1, '0x28a50e': 1, '0x22131a': 1}), header=QobjExperimentHeader(clbit_labels=[['meas', 0], ['meas', 1], ['meas', 2], ['meas', 3], ['meas', 4], ['meas', 5], ['meas', 6], ['meas', 7], ['meas', 8], ['meas', 9], ['meas', 10], ['meas', 11], ['meas', 12], ['meas', 13], ['meas', 14], ['meas', 15], ['meas', 16], ['meas', 17], ['meas', 18], ['meas', 19], ['meas', 20], ['meas', 21], ['meas', 22], ['meas', 23]], creg_sizes=[['meas', 24]], global_phase=0.0, memory_slots=24, metadata={}, n_qubits=24, name='circuit-80', qreg_sizes=[['q', 24]], qubit_labels=[['q', 0], ['q', 1], ['q', 2], ['q', 3], ['q', 4], ['q', 5], ['q', 6], ['q', 7], ['q', 8], ['q', 9], ['q', 10], ['q', 11], ['q', 12], ['q', 13], ['q', 14], ['q', 15], ['q', 16], ['q', 17], ['q', 18], ['q', 19], ['q', 20], ['q', 21], ['q', 22], ['q', 23]]), status=DONE, seed_simulator=221709455, metadata={'noise': 'ideal', 'batched_shots_optimization': False, 'measure_sampling': True, 'parallel_shots': 1, 'remapped_qubits': False, 'active_input_qubits': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23], 'num_clbits': 24, 'parallel_state_update': 4, 'sample_measure_time': 0.017261789, 'num_qubits': 24, 'device': 'GPU', 'input_qubit_map': [[23, 23], [22, 22], [21, 21], [20, 20], [19, 19], [18, 18], [17, 17], [16, 16], [15, 15], [14, 14], [13, 13], [0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 7], [8, 8], [9, 9], [10, 10], [11, 11], [12, 12]], 'method': 'statevector', 'cacheblocking': {'max_multiple_chunk_swaps': 7, 'multiple_chunk_swaps_buffer_qubits': 15, 'multiple_chunk_swaps_enable': True, 'chunk_parallel_gpus': 1, 'block_bits': 22, 'enabled': True}, 'fusion': {'applied': True, 'time_taken': 0.000284886, 'cost_factor': 1.8, 'parallelization': 1, 'max_fused_qubits': 5, 'method': 'unitary', 'threshold': 14, 'enabled': True}}, time_taken=0.057612485)], date=2022-08-25T13:36:34.560278, status=COMPLETED, header=QobjHeader(backend_name='aer_simulator_statevector_gpu', backend_version='0.11.0'), metadata={'time_taken': 0.057871769, 'time_taken_execute': 0.057631375, 'mpi_rank': 3, 'num_mpi_processes': 4, 'max_gpu_memory_mb': 22586, 'max_memory_mb': 15815, 'parallel_experiments': 1, 'time_taken_load_qobj': 0.000232224, 'num_processes_per_experiments': 4, 'omp_enabled': True}, time_taken=0.05872750282287598)
Result(backend_name='aer_simulator_statevector_gpu', backend_version='0.11.0', qobj_id='fbd08923-668b-4292-ac1d-1bfb9a76048b', job_id='4ce25a7f-104c-4b80-9035-6a6d76fe27c1', success=True, results=[ExperimentResult(shots=10, success=True, meas_level=2, data=ExperimentResultData(counts={'0x32f659': 1, '0x3c24de': 1, '0x3635fe': 1, '0x156796': 1, '0xcbda9': 1, '0x23339f': 1, '0x321685': 1, '0x23540d': 1, '0x28a50e': 1, '0x22131a': 1}), header=QobjExperimentHeader(clbit_labels=[['meas', 0], ['meas', 1], ['meas', 2], ['meas', 3], ['meas', 4], ['meas', 5], ['meas', 6], ['meas', 7], ['meas', 8], ['meas', 9], ['meas', 10], ['meas', 11], ['meas', 12], ['meas', 13], ['meas', 14], ['meas', 15], ['meas', 16], ['meas', 17], ['meas', 18], ['meas', 19], ['meas', 20], ['meas', 21], ['meas', 22], ['meas', 23]], creg_sizes=[['meas', 24]], global_phase=0.0, memory_slots=24, metadata={}, n_qubits=24, name='circuit-80', qreg_sizes=[['q', 24]], qubit_labels=[['q', 0], ['q', 1], ['q', 2], ['q', 3], ['q', 4], ['q', 5], ['q', 6], ['q', 7], ['q', 8], ['q', 9], ['q', 10], ['q', 11], ['q', 12], ['q', 13], ['q', 14], ['q', 15], ['q', 16], ['q', 17], ['q', 18], ['q', 19], ['q', 20], ['q', 21], ['q', 22], ['q', 23]]), status=DONE, seed_simulator=2515442787, metadata={'noise': 'ideal', 'batched_shots_optimization': False, 'measure_sampling': True, 'parallel_shots': 1, 'remapped_qubits': False, 'active_input_qubits': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23], 'num_clbits': 24, 'parallel_state_update': 4, 'sample_measure_time': 0.002061268, 'num_qubits': 24, 'device': 'GPU', 'input_qubit_map': [[23, 23], [22, 22], [21, 21], [20, 20], [19, 19], [18, 18], [17, 17], [16, 16], [15, 15], [14, 14], [13, 13], [0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 7], [8, 8], [9, 9], [10, 10], [11, 11], [12, 12]], 'method': 'statevector', 'cacheblocking': {'max_multiple_chunk_swaps': 7, 'multiple_chunk_swaps_buffer_qubits': 15, 'multiple_chunk_swaps_enable': True, 'chunk_parallel_gpus': 1, 'block_bits': 22, 'enabled': True}, 'fusion': {'applied': True, 'time_taken': 0.000289585, 'cost_factor': 1.8, 'parallelization': 1, 'max_fused_qubits': 5, 'method': 'unitary', 'threshold': 14, 'enabled': True}}, time_taken=0.057648494)], date=2022-08-25T13:36:34.559308, status=COMPLETED, header=QobjHeader(backend_name='aer_simulator_statevector_gpu', backend_version='0.11.0'), metadata={'time_taken': 0.057937189, 'time_taken_execute': 0.057668424, 'mpi_rank': 0, 'num_mpi_processes': 4, 'max_gpu_memory_mb': 22586, 'max_memory_mb': 15815, 'parallel_experiments': 1, 'time_taken_load_qobj': 0.000259435, 'num_processes_per_experiments': 4, 'omp_enabled': True}, time_taken=0.0599055290222168)
Result(backend_name='aer_simulator_statevector_gpu', backend_version='0.11.0', qobj_id='e98c4609-bb58-48c7-b5c4-04aafa73aa62', job_id='5e7fb2e3-9ceb-4683-934f-93cb9fb5f030', success=True, results=[ExperimentResult(shots=10, success=True, meas_level=2, data=ExperimentResultData(counts={'0x32f659': 1, '0x3c24de': 1, '0x3635fe': 1, '0x156796': 1, '0xcbda9': 1, '0x23339f': 1, '0x321685': 1, '0x23540d': 1, '0x28a50e': 1, '0x22131a': 1}), header=QobjExperimentHeader(clbit_labels=[['meas', 0], ['meas', 1], ['meas', 2], ['meas', 3], ['meas', 4], ['meas', 5], ['meas', 6], ['meas', 7], ['meas', 8], ['meas', 9], ['meas', 10], ['meas', 11], ['meas', 12], ['meas', 13], ['meas', 14], ['meas', 15], ['meas', 16], ['meas', 17], ['meas', 18], ['meas', 19], ['meas', 20], ['meas', 21], ['meas', 22], ['meas', 23]], creg_sizes=[['meas', 24]], global_phase=0.0, memory_slots=24, metadata={}, n_qubits=24, name='circuit-80', qreg_sizes=[['q', 24]], qubit_labels=[['q', 0], ['q', 1], ['q', 2], ['q', 3], ['q', 4], ['q', 5], ['q', 6], ['q', 7], ['q', 8], ['q', 9], ['q', 10], ['q', 11], ['q', 12], ['q', 13], ['q', 14], ['q', 15], ['q', 16], ['q', 17], ['q', 18], ['q', 19], ['q', 20], ['q', 21], ['q', 22], ['q', 23]]), status=DONE, seed_simulator=4111680549, metadata={'noise': 'ideal', 'batched_shots_optimization': False, 'measure_sampling': True, 'parallel_shots': 1, 'remapped_qubits': False, 'active_input_qubits': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23], 'num_clbits': 24, 'parallel_state_update': 4, 'sample_measure_time': 0.015761308, 'num_qubits': 24, 'device': 'GPU', 'input_qubit_map': [[23, 23], [22, 22], [21, 21], [20, 20], [19, 19], [18, 18], [17, 17], [16, 16], [15, 15], [14, 14], [13, 13], [0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 7], [8, 8], [9, 9], [10, 10], [11, 11], [12, 12]], 'method': 'statevector', 'cacheblocking': {'max_multiple_chunk_swaps': 7, 'multiple_chunk_swaps_buffer_qubits': 15, 'multiple_chunk_swaps_enable': True, 'chunk_parallel_gpus': 1, 'block_bits': 22, 'enabled': True}, 'fusion': {'applied': True, 'time_taken': 0.000274405, 'cost_factor': 1.8, 'parallelization': 1, 'max_fused_qubits': 5, 'method': 'unitary', 'threshold': 14, 'enabled': True}}, time_taken=0.057645813)], date=2022-08-25T13:36:34.559308, status=COMPLETED, header=QobjHeader(backend_name='aer_simulator_statevector_gpu', backend_version='0.11.0'), metadata={'time_taken': 0.057914288, 'time_taken_execute': 0.057661083, 'mpi_rank': 1, 'num_mpi_processes': 4, 'max_gpu_memory_mb': 22586, 'max_memory_mb': 15815, 'parallel_experiments': 1, 'time_taken_load_qobj': 0.000244824, 'num_processes_per_experiments': 4, 'omp_enabled': True}, time_taken=0.0599062442779541)

Something similar happens with 8, 16, etc. processes.

What is the expected behavior?

It should work with all qubits performing operations, without leaving "blank" qubits.

Suggested solutions

I think Qiskit Aer is not managing correctly the memory, but I don't know exactly the cause of error.

If you can't replicate this error, please share your hardware setup and installation process. I would really appreciate it!

Thanks a lot for your help!! :)

anavasca commented 1 year ago

I am interested in knowing if the problem has been solved. @doichanj

doichanj commented 1 year ago

I have not been able to reproduce this issue in my environment (Power9 + IBM Spectral MPI) I think this issue is depending on the MPI build and maybe related to GPU direct RDMA. Please make sure if the MPI implementation supports RDMA and try using MPI's options to enable RDMA.

hhorii commented 1 year ago

Let me close this issue because of no response in more than two weeks. Please create a new issue when this issue should be fixed in your environment.