Encountring "MPI_ERR_COUNT: invalid count argument" when creating GHZ states on multiple nodes

intelligi123 commented 9 months ago

Informations

Qiskit Aer version: 0.14.0
Python version: 3.11.6
Operating system: Ubuntu 23.10

What is the current behavior?

I am running a code to create GHZ state using 30 qubits, using statevector simulator which generated insufficient memory error

qiskit.exceptions.QiskitError: 'ERROR: [Experiment 0] Insufficient memory to run circuit circuit-158 using the statevector simulator. Required memory: 16384M, max memory: 15903M , ERROR: Insufficient memory to run circuit circuit-158 using the statevector simulator. Required memory: 16384M, max memory: 15903M' I added a node and run script with two nodes but it spilled above error:

command:

mpirun -np 2 -machinefile machinefile.txt python3 ghz.py

Error:

[dell-Precision-Tower-5810:24773] *** An error occurred in MPI_Irecv
[dell-Precision-Tower-5810:24773] *** reported by process [3164471297,0]
[dell-Precision-Tower-5810:24773] *** on communicator MPI_COMM_WORLD
[dell-Precision-Tower-5810:24773] *** MPI_ERR_COUNT: invalid count argument
[dell-Precision-Tower-5810:24773] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[dell-Precision-Tower-5810:24773] ***    and potentially your MPI job)
[dell-5810:03630] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[dell-Precision-Tower-5810:24768] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[dell-Precision-Tower-5810:24768] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Here is the code

from qiskit import QuantumCircuit, transpile
from qiskit_aer import *

def create_ghz_circuit(n_qubits):
    circuit = QuantumCircuit(n_qubits)
    circuit.h(0)
    for qubit in range(n_qubits - 1):
        circuit.cx(qubit, qubit + 1)
    return circuit

n_qubits=30
simulator = AerSimulator(method='statevector',device='CPU',blocking_enable=True, blocking_qubits=n_qubits-2)
circuit = create_ghz_circuit(n_qubits)
print(circuit.num_qubits)
circuit.measure_all()
job = simulator.run(circuit)
result = job.result()

Steps to reproduce the problem

Running code with mpirun generates error

What is the expected behavior?

Insufficient Memory issue should be resolved and code should able to simulate GHZ state.

Suggested solutions

The error is in MPI_Irecv method of MPI and MPI_ERR_COUNT: invalid count argument suggests that there is some mismatch in argument type.

doichanj commented 8 months ago

could you try running with smaller qubits on 2 nodes, and also smaller qubits on single node with multiple-processes

intelligi123 commented 8 months ago

I selected 28 qubits and code is same except I have added algorithm_globals.random_seed=1000:

Here is the code:

from qiskit import QuantumCircuit, transpile
from qiskit_aer import *

from qiskit_algorithms.utils import algorithm_globals
algorithm_globals.random_seed = 1000

def create_ghz_circuit(n_qubits):
    circuit = QuantumCircuit(n_qubits)
    circuit.h(0)
    for qubit in range(n_qubits - 1):
        circuit.cx(qubit, qubit + 1)
    return circuit

n_qubits=28
simulator = AerSimulator(method='statevector',seed_simulator = algorithm_globals.random_seed, device='GPU',blocking_enable=True, blocking_qubits=n_qubits-2)
circuit = create_ghz_circuit(n_qubits)
print(circuit.num_qubits)
circuit.measure_all()
job = simulator.run(circuit)
result = job.result()
print(result)

For the case of two nodes: I got full result variable as:

mpirun -np 2 -machinefile machinefile.txt python3 ghz.py

Result(backend_name='aer_simulator', backend_version='0.14.0', qobj_id='', job_id='a7e6782f-e971-4fbc-9503-1395c1bcec4f', success=True, results=[ExperimentResult(shots=1024, success=True, meas_level=2, data=ExperimentResultData(counts={'0x0': 530, '0xfffffff': 494}), header=QobjExperimentHeader(creg_sizes=[['meas', 28]], global_phase=0.0, memory_slots=28, n_qubits=28, name='circuit-164', qreg_sizes=[['q', 28]], metadata={}), status=DONE, seed_simulator=1000, metadata={'time_taken': 190.778112021, 'num_bind_params': 1, 'parallel_state_update': 2, 'parallel_shots': 1, 'sample_measure_time': 0.051840722, 'required_memory_mb': 4096, 'input_qubit_map': [[27, 27], [26, 26], [25, 25], [24, 24], [23, 23], [22, 22], [21, 21], [20, 20], [19, 19], [18, 18], [17, 17], [16, 16], [15, 15], [14, 14], [13, 13], [0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 7], [8, 8], [9, 9], [10, 10], [11, 11], [12, 12]], 'max_gpu_memory_mb': 5933, 'method': 'statevector', 'device': 'GPU', 'num_qubits': 28, 'active_input_qubits': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27], 'num_clbits': 28, 'remapped_qubits': False, 'runtime_parameter_bind': False, 'max_memory_mb': 15903, 'target_gpus': [0], 'noise': 'ideal', 'measure_sampling': True, 'batched_shots_optimization': False, 'fusion': {'applied': True, 'time_taken': 0.000371272, 'cost_factor': 1.8, 'parallelization': 1, 'max_fused_qubits': 5, 'method': 'unitary', 'threshold': 14, 'enabled': True}, 'cacheblocking': {'max_multiple_chunk_swaps': 11, 'multiple_chunk_swaps_buffer_qubits': 15, 'multiple_chunk_swaps_enable': True, 'chunk_parallel_gpus': 1, 'block_bits': 26, 'enabled': True}}, time_taken=190.778112021)], date=2024-03-13T10:09:14.735699, status=COMPLETED, header=None, metadata={'time_taken_execute': 190.816238386, 'mpi_rank': 0, 'time_taken_parameter_binding': 5.5836e-05, 'num_mpi_processes': 2, 'num_processes_per_experiments': 2, 'omp_enabled': True, 'max_gpu_memory_mb': 5933, 'max_memory_mb': 15903, 'parallel_experiments': 1}, time_taken=190.94678616523743)
Result(backend_name='aer_simulator', backend_version='0.14.0', qobj_id='', job_id='0e0b0850-a5ef-404d-9dd4-bb2546c3cf68', success=True, results=[ExperimentResult(shots=1024, success=True, meas_level=2, data=ExperimentResultData(counts={'0x0': 530, '0xfffffff': 494}), header=QobjExperimentHeader(creg_sizes=[['meas', 28]], global_phase=0.0, memory_slots=28, n_qubits=28, name='circuit-158', qreg_sizes=[['q', 28]], metadata={}), status=DONE, seed_simulator=1000, metadata={'time_taken': 190.769095649, 'num_bind_params': 1, 'parallel_state_update': 2, 'parallel_shots': 1, 'sample_measure_time': 0.062222302, 'required_memory_mb': 4096, 'input_qubit_map': [[27, 27], [26, 26], [25, 25], [24, 24], [23, 23], [22, 22], [21, 21], [20, 20], [19, 19], [18, 18], [17, 17], [16, 16], [15, 15], [14, 14], [13, 13], [0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 7], [8, 8], [9, 9], [10, 10], [11, 11], [12, 12]], 'max_gpu_memory_mb': 5933, 'method': 'statevector', 'device': 'GPU', 'num_qubits': 28, 'active_input_qubits': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27], 'num_clbits': 28, 'remapped_qubits': False, 'runtime_parameter_bind': False, 'max_memory_mb': 15903, 'target_gpus': [0], 'noise': 'ideal', 'measure_sampling': True, 'batched_shots_optimization': False, 'fusion': {'applied': True, 'time_taken': 0.000387979, 'cost_factor': 1.8, 'parallelization': 1, 'max_fused_qubits': 5, 'method': 'unitary', 'threshold': 14, 'enabled': True}, 'cacheblocking': {'max_multiple_chunk_swaps': 11, 'multiple_chunk_swaps_buffer_qubits': 15, 'multiple_chunk_swaps_enable': True, 'chunk_parallel_gpus': 1, 'block_bits': 26, 'enabled': True}}, time_taken=190.769095649)], date=2024-03-13T10:09:14.723119, status=COMPLETED, header=None, metadata={'time_taken_execute': 190.806562321, 'mpi_rank': 1, 'time_taken_parameter_binding': 4.7389e-05, 'num_mpi_processes': 2, 'num_processes_per_experiments': 2, 'omp_enabled': True, 'max_gpu_memory_mb': 5933, 'max_memory_mb': 15903, 'parallel_experiments': 1}, time_taken=193.54268836975098)

Queries: Here I am expecting simulator to share resources and distribute statevector into two memory spaces but I think from results its looklike that two independent circuits are running on each node which I dont want.

For multiple processes on single node: When I run above code , it generated error;

std::bad_alloc: cudaErrorMemoryAllocation: out of memory

and worked fine when ran while selecting device as CPU

Result(backend_name='aer_simulator', backend_version='0.14.0', qobj_id='', job_id='17b7879c-e5b3-4fbf-bb1e-5ef2addb93c7', success=True, results=[ExperimentResult(shots=1024, success=True, meas_level=2, data=ExperimentResultData(counts={'0x0': 530, '0xfffffff': 494}), header=QobjExperimentHeader(creg_sizes=[['meas', 28]], global_phase=0.0, memory_slots=28, n_qubits=28, name='circuit-164', qreg_sizes=[['q', 28]], metadata={}), status=DONE, seed_simulator=1000, metadata={'time_taken': 39.796454583, 'num_bind_params': 1, 'parallel_state_update': 2, 'parallel_shots': 1, 'required_memory_mb': 4096, 'input_qubit_map': [[27, 27], [26, 26], [25, 25], [24, 24], [23, 23], [22, 22], [21, 21], [20, 20], [19, 19], [18, 18], [17, 17], [16, 16], [15, 15], [14, 14], [13, 13], [0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 7], [8, 8], [9, 9], [10, 10], [11, 11], [12, 12]], 'method': 'statevector', 'device': 'CPU', 'num_qubits': 28, 'sample_measure_time': 0.490546031, 'active_input_qubits': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27], 'num_clbits': 28, 'remapped_qubits': False, 'runtime_parameter_bind': False, 'max_memory_mb': 15903, 'noise': 'ideal', 'measure_sampling': True, 'batched_shots_optimization': False, 'fusion': {'applied': True, 'time_taken': 0.000383349, 'cost_factor': 1.8, 'parallelization': 1, 'max_fused_qubits': 5, 'method': 'unitary', 'threshold': 14, 'enabled': True}, 'cacheblocking': {'max_multiple_chunk_swaps': 11, 'multiple_chunk_swaps_buffer_qubits': 15, 'multiple_chunk_swaps_enable': True, 'block_bits': 26, 'enabled': True}}, time_taken=39.796454583)], date=2024-03-13T10:11:57.032453, status=COMPLETED, header=None, metadata={'time_taken_execute': 39.965566354, 'mpi_rank': 0, 'time_taken_parameter_binding': 4.7416e-05, 'num_mpi_processes': 2, 'num_processes_per_experiments': 2, 'omp_enabled': True, 'max_gpu_memory_mb': 0, 'max_memory_mb': 15903, 'parallel_experiments': 1}, time_taken=39.966766595840454)
Result(backend_name='aer_simulator', backend_version='0.14.0', qobj_id='', job_id='c33d9971-88e0-44d6-ade9-219e08795d3e', success=True, results=[ExperimentResult(shots=1024, success=True, meas_level=2, data=ExperimentResultData(counts={'0x0': 530, '0xfffffff': 494}), header=QobjExperimentHeader(creg_sizes=[['meas', 28]], global_phase=0.0, memory_slots=28, n_qubits=28, name='circuit-164', qreg_sizes=[['q', 28]], metadata={}), status=DONE, seed_simulator=1000, metadata={'time_taken': 39.79647343, 'num_bind_params': 1, 'parallel_state_update': 2, 'parallel_shots': 1, 'required_memory_mb': 4096, 'input_qubit_map': [[27, 27], [26, 26], [25, 25], [24, 24], [23, 23], [22, 22], [21, 21], [20, 20], [19, 19], [18, 18], [17, 17], [16, 16], [15, 15], [14, 14], [13, 13], [0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 7], [8, 8], [9, 9], [10, 10], [11, 11], [12, 12]], 'method': 'statevector', 'device': 'CPU', 'num_qubits': 28, 'sample_measure_time': 0.472537557, 'active_input_qubits': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27], 'num_clbits': 28, 'remapped_qubits': False, 'runtime_parameter_bind': False, 'max_memory_mb': 15903, 'noise': 'ideal', 'measure_sampling': True, 'batched_shots_optimization': False, 'fusion': {'applied': True, 'time_taken': 0.00035926, 'cost_factor': 1.8, 'parallelization': 1, 'max_fused_qubits': 5, 'method': 'unitary', 'threshold': 14, 'enabled': True}, 'cacheblocking': {'max_multiple_chunk_swaps': 11, 'multiple_chunk_swaps_buffer_qubits': 15, 'multiple_chunk_swaps_enable': True, 'block_bits': 26, 'enabled': True}}, time_taken=39.79647343)], date=2024-03-13T10:11:57.034494, status=COMPLETED, header=None, metadata={'time_taken_execute': 39.96762756, 'mpi_rank': 1, 'time_taken_parameter_binding': 4.3155e-05, 'num_mpi_processes': 2, 'num_processes_per_experiments': 2, 'omp_enabled': True, 'max_gpu_memory_mb': 0, 'max_memory_mb': 15903, 'parallel_experiments': 1}, time_taken=39.96878981590271)

And again I tried adding qubits to 31 with device as CPU and ran on two nodes, it generated error:

Simulation failed and returned the following error message:
ERROR:  [Experiment 0] Insufficient memory to run circuit circuit-164 using the statevector simulator. Required memory: 16384M, max memory: 15903M
Result(backend_name='aer_simulator', backend_version='0.14.0', qobj_id='', job_id='620aee11-405f-486d-8c1c-1dfae26aeb32', success=False, results=[ExperimentResult(shots=0, success=False, meas_level=2, data=ExperimentResultData(), status=ERROR: Insufficient memory to run circuit circuit-164 using the statevector simulator. Required memory: 16384M, max memory: 15903M, circ_id=0, seed_simulator=0, metadata={'batched_shots_optimization': False, 'measure_sampling': False, 'max_memory_mb': 15903, 'remapped_qubits': False, 'active_input_qubits': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30], 'num_clbits': 31, 'num_qubits': 31, 'device': 'CPU', 'input_qubit_map': [[30, 30], [29, 29], [12, 12], [11, 11], [10, 10], [9, 9], [8, 8], [7, 7], [6, 6], [5, 5], [4, 4], [3, 3], [2, 2], [1, 1], [0, 0], [13, 13], [14, 14], [15, 15], [16, 16], [17, 17], [18, 18], [19, 19], [20, 20], [21, 21], [22, 22], [23, 23], [24, 24], [25, 25], [26, 26], [27, 27], [28, 28]], 'method': 'statevector', 'required_memory_mb': 32768}, time_taken=0.0)], date=2024-03-13T10:21:28.585262, status=ERROR:  [Experiment 0] Insufficient memory to run circuit circuit-164 using the statevector simulator. Required memory: 16384M, max memory: 15903M, header=None, metadata={'time_taken_execute': 0.011740267, 'mpi_rank': 0, 'time_taken_parameter_binding': 5.0978e-05, 'num_mpi_processes': 2, 'num_processes_per_experiments': 2, 'omp_enabled': True, 'max_gpu_memory_mb': 0, 'max_memory_mb': 15903, 'parallel_experiments': 1}, time_taken=0.023772716522216797)
Simulation failed and returned the following error message:
ERROR:  [Experiment 0] Insufficient memory to run circuit circuit-158 using the statevector simulator. Required memory: 16384M, max memory: 15903M
Result(backend_name='aer_simulator', backend_version='0.14.0', qobj_id='', job_id='6fc632cc-f5ba-4373-977f-d8dd20980c6b', success=False, results=[ExperimentResult(shots=0, success=False, meas_level=2, data=ExperimentResultData(), status=ERROR: Insufficient memory to run circuit circuit-158 using the statevector simulator. Required memory: 16384M, max memory: 15903M, circ_id=0, seed_simulator=0, metadata={'batched_shots_optimization': False, 'measure_sampling': False, 'max_memory_mb': 15903, 'remapped_qubits': False, 'active_input_qubits': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30], 'num_clbits': 31, 'num_qubits': 31, 'device': 'CPU', 'input_qubit_map': [[30, 30], [29, 29], [12, 12], [11, 11], [10, 10], [9, 9], [8, 8], [7, 7], [6, 6], [5, 5], [4, 4], [3, 3], [2, 2], [1, 1], [0, 0], [13, 13], [14, 14], [15, 15], [16, 16], [17, 17], [18, 18], [19, 19], [20, 20], [21, 21], [22, 22], [23, 23], [24, 24], [25, 25], [26, 26], [27, 27], [28, 28]], 'method': 'statevector', 'required_memory_mb': 32768}, time_taken=0.0)], date=2024-03-13T10:21:28.535773, status=ERROR:  [Experiment 0] Insufficient memory to run circuit circuit-158 using the statevector simulator. Required memory: 16384M, max memory: 15903M, header=None, metadata={'time_taken_execute': 0.013288266, 'mpi_rank': 1, 'time_taken_parameter_binding': 5.1933e-05, 'num_mpi_processes': 2, 'num_processes_per_experiments': 2, 'omp_enabled': True, 'max_gpu_memory_mb': 0, 'max_memory_mb': 15903, 'parallel_experiments': 1}, time_taken=0.031948089599609375)

Queries:

Here memory required is 16384M and two nodes together make 15903+15903=31806Mwhich is sufficient for the circuit if it shared resources, but as its running as two independent circuit it generate error.

Similar Error is being generated when I run with device=GPUonly now its from CUDA

std::bad_alloc: cudaErrorMemoryAllocation: out of memory

So main problem is my circuit is not running by distributing statevector and sharing resources. How can I achieve this?

intelligi123 commented 7 months ago

Hi @doichanj, Is there any update on the issue?

btw I asked this question on openmpi issues and according to there response this is some sort of type error

size_t instead of an int to call MPI_Irecv.

Can you please suggest what I can do to resolve this or I need to wait for a patch?

Just want to make one thing clear, if my circuit is taking total of 16G RAM, calling two mpi process on two nodes (one each) will divide the required resources (8G on each node) or not as in my case both nodes are using 16G RAM as two independent processes (statevectors) are running as opposed to distribution of one statevector.

Guogggg commented 2 weeks ago

Is there any update on the issue? I ran into the same problem using intel mpi and run the command mpirun -np-2-machinefile hostfile python example.py

Qiskit / qiskit-aer