Open intelligi123 opened 9 months ago
could you try running with smaller qubits on 2 nodes, and also smaller qubits on single node with multiple-processes
I selected 28 qubits and code is same except I have added algorithm_globals.random_seed=1000:
Here is the code:
from qiskit import QuantumCircuit, transpile
from qiskit_aer import *
from qiskit_algorithms.utils import algorithm_globals
algorithm_globals.random_seed = 1000
def create_ghz_circuit(n_qubits):
circuit = QuantumCircuit(n_qubits)
circuit.h(0)
for qubit in range(n_qubits - 1):
circuit.cx(qubit, qubit + 1)
return circuit
n_qubits=28
simulator = AerSimulator(method='statevector',seed_simulator = algorithm_globals.random_seed, device='GPU',blocking_enable=True, blocking_qubits=n_qubits-2)
circuit = create_ghz_circuit(n_qubits)
print(circuit.num_qubits)
circuit.measure_all()
job = simulator.run(circuit)
result = job.result()
print(result)
For the case of two nodes: I got full result variable as:
mpirun -np 2 -machinefile machinefile.txt python3 ghz.py
Result(backend_name='aer_simulator', backend_version='0.14.0', qobj_id='', job_id='a7e6782f-e971-4fbc-9503-1395c1bcec4f', success=True, results=[ExperimentResult(shots=1024, success=True, meas_level=2, data=ExperimentResultData(counts={'0x0': 530, '0xfffffff': 494}), header=QobjExperimentHeader(creg_sizes=[['meas', 28]], global_phase=0.0, memory_slots=28, n_qubits=28, name='circuit-164', qreg_sizes=[['q', 28]], metadata={}), status=DONE, seed_simulator=1000, metadata={'time_taken': 190.778112021, 'num_bind_params': 1, 'parallel_state_update': 2, 'parallel_shots': 1, 'sample_measure_time': 0.051840722, 'required_memory_mb': 4096, 'input_qubit_map': [[27, 27], [26, 26], [25, 25], [24, 24], [23, 23], [22, 22], [21, 21], [20, 20], [19, 19], [18, 18], [17, 17], [16, 16], [15, 15], [14, 14], [13, 13], [0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 7], [8, 8], [9, 9], [10, 10], [11, 11], [12, 12]], 'max_gpu_memory_mb': 5933, 'method': 'statevector', 'device': 'GPU', 'num_qubits': 28, 'active_input_qubits': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27], 'num_clbits': 28, 'remapped_qubits': False, 'runtime_parameter_bind': False, 'max_memory_mb': 15903, 'target_gpus': [0], 'noise': 'ideal', 'measure_sampling': True, 'batched_shots_optimization': False, 'fusion': {'applied': True, 'time_taken': 0.000371272, 'cost_factor': 1.8, 'parallelization': 1, 'max_fused_qubits': 5, 'method': 'unitary', 'threshold': 14, 'enabled': True}, 'cacheblocking': {'max_multiple_chunk_swaps': 11, 'multiple_chunk_swaps_buffer_qubits': 15, 'multiple_chunk_swaps_enable': True, 'chunk_parallel_gpus': 1, 'block_bits': 26, 'enabled': True}}, time_taken=190.778112021)], date=2024-03-13T10:09:14.735699, status=COMPLETED, header=None, metadata={'time_taken_execute': 190.816238386, 'mpi_rank': 0, 'time_taken_parameter_binding': 5.5836e-05, 'num_mpi_processes': 2, 'num_processes_per_experiments': 2, 'omp_enabled': True, 'max_gpu_memory_mb': 5933, 'max_memory_mb': 15903, 'parallel_experiments': 1}, time_taken=190.94678616523743)
Result(backend_name='aer_simulator', backend_version='0.14.0', qobj_id='', job_id='0e0b0850-a5ef-404d-9dd4-bb2546c3cf68', success=True, results=[ExperimentResult(shots=1024, success=True, meas_level=2, data=ExperimentResultData(counts={'0x0': 530, '0xfffffff': 494}), header=QobjExperimentHeader(creg_sizes=[['meas', 28]], global_phase=0.0, memory_slots=28, n_qubits=28, name='circuit-158', qreg_sizes=[['q', 28]], metadata={}), status=DONE, seed_simulator=1000, metadata={'time_taken': 190.769095649, 'num_bind_params': 1, 'parallel_state_update': 2, 'parallel_shots': 1, 'sample_measure_time': 0.062222302, 'required_memory_mb': 4096, 'input_qubit_map': [[27, 27], [26, 26], [25, 25], [24, 24], [23, 23], [22, 22], [21, 21], [20, 20], [19, 19], [18, 18], [17, 17], [16, 16], [15, 15], [14, 14], [13, 13], [0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 7], [8, 8], [9, 9], [10, 10], [11, 11], [12, 12]], 'max_gpu_memory_mb': 5933, 'method': 'statevector', 'device': 'GPU', 'num_qubits': 28, 'active_input_qubits': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27], 'num_clbits': 28, 'remapped_qubits': False, 'runtime_parameter_bind': False, 'max_memory_mb': 15903, 'target_gpus': [0], 'noise': 'ideal', 'measure_sampling': True, 'batched_shots_optimization': False, 'fusion': {'applied': True, 'time_taken': 0.000387979, 'cost_factor': 1.8, 'parallelization': 1, 'max_fused_qubits': 5, 'method': 'unitary', 'threshold': 14, 'enabled': True}, 'cacheblocking': {'max_multiple_chunk_swaps': 11, 'multiple_chunk_swaps_buffer_qubits': 15, 'multiple_chunk_swaps_enable': True, 'chunk_parallel_gpus': 1, 'block_bits': 26, 'enabled': True}}, time_taken=190.769095649)], date=2024-03-13T10:09:14.723119, status=COMPLETED, header=None, metadata={'time_taken_execute': 190.806562321, 'mpi_rank': 1, 'time_taken_parameter_binding': 4.7389e-05, 'num_mpi_processes': 2, 'num_processes_per_experiments': 2, 'omp_enabled': True, 'max_gpu_memory_mb': 5933, 'max_memory_mb': 15903, 'parallel_experiments': 1}, time_taken=193.54268836975098)
Queries: Here I am expecting simulator to share resources and distribute statevector into two memory spaces but I think from results its looklike that two independent circuits are running on each node which I dont want.
For multiple processes on single node: When I run above code , it generated error;
std::bad_alloc: cudaErrorMemoryAllocation: out of memory
and worked fine when ran while selecting device as CPU
Result(backend_name='aer_simulator', backend_version='0.14.0', qobj_id='', job_id='17b7879c-e5b3-4fbf-bb1e-5ef2addb93c7', success=True, results=[ExperimentResult(shots=1024, success=True, meas_level=2, data=ExperimentResultData(counts={'0x0': 530, '0xfffffff': 494}), header=QobjExperimentHeader(creg_sizes=[['meas', 28]], global_phase=0.0, memory_slots=28, n_qubits=28, name='circuit-164', qreg_sizes=[['q', 28]], metadata={}), status=DONE, seed_simulator=1000, metadata={'time_taken': 39.796454583, 'num_bind_params': 1, 'parallel_state_update': 2, 'parallel_shots': 1, 'required_memory_mb': 4096, 'input_qubit_map': [[27, 27], [26, 26], [25, 25], [24, 24], [23, 23], [22, 22], [21, 21], [20, 20], [19, 19], [18, 18], [17, 17], [16, 16], [15, 15], [14, 14], [13, 13], [0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 7], [8, 8], [9, 9], [10, 10], [11, 11], [12, 12]], 'method': 'statevector', 'device': 'CPU', 'num_qubits': 28, 'sample_measure_time': 0.490546031, 'active_input_qubits': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27], 'num_clbits': 28, 'remapped_qubits': False, 'runtime_parameter_bind': False, 'max_memory_mb': 15903, 'noise': 'ideal', 'measure_sampling': True, 'batched_shots_optimization': False, 'fusion': {'applied': True, 'time_taken': 0.000383349, 'cost_factor': 1.8, 'parallelization': 1, 'max_fused_qubits': 5, 'method': 'unitary', 'threshold': 14, 'enabled': True}, 'cacheblocking': {'max_multiple_chunk_swaps': 11, 'multiple_chunk_swaps_buffer_qubits': 15, 'multiple_chunk_swaps_enable': True, 'block_bits': 26, 'enabled': True}}, time_taken=39.796454583)], date=2024-03-13T10:11:57.032453, status=COMPLETED, header=None, metadata={'time_taken_execute': 39.965566354, 'mpi_rank': 0, 'time_taken_parameter_binding': 4.7416e-05, 'num_mpi_processes': 2, 'num_processes_per_experiments': 2, 'omp_enabled': True, 'max_gpu_memory_mb': 0, 'max_memory_mb': 15903, 'parallel_experiments': 1}, time_taken=39.966766595840454)
Result(backend_name='aer_simulator', backend_version='0.14.0', qobj_id='', job_id='c33d9971-88e0-44d6-ade9-219e08795d3e', success=True, results=[ExperimentResult(shots=1024, success=True, meas_level=2, data=ExperimentResultData(counts={'0x0': 530, '0xfffffff': 494}), header=QobjExperimentHeader(creg_sizes=[['meas', 28]], global_phase=0.0, memory_slots=28, n_qubits=28, name='circuit-164', qreg_sizes=[['q', 28]], metadata={}), status=DONE, seed_simulator=1000, metadata={'time_taken': 39.79647343, 'num_bind_params': 1, 'parallel_state_update': 2, 'parallel_shots': 1, 'required_memory_mb': 4096, 'input_qubit_map': [[27, 27], [26, 26], [25, 25], [24, 24], [23, 23], [22, 22], [21, 21], [20, 20], [19, 19], [18, 18], [17, 17], [16, 16], [15, 15], [14, 14], [13, 13], [0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 7], [8, 8], [9, 9], [10, 10], [11, 11], [12, 12]], 'method': 'statevector', 'device': 'CPU', 'num_qubits': 28, 'sample_measure_time': 0.472537557, 'active_input_qubits': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27], 'num_clbits': 28, 'remapped_qubits': False, 'runtime_parameter_bind': False, 'max_memory_mb': 15903, 'noise': 'ideal', 'measure_sampling': True, 'batched_shots_optimization': False, 'fusion': {'applied': True, 'time_taken': 0.00035926, 'cost_factor': 1.8, 'parallelization': 1, 'max_fused_qubits': 5, 'method': 'unitary', 'threshold': 14, 'enabled': True}, 'cacheblocking': {'max_multiple_chunk_swaps': 11, 'multiple_chunk_swaps_buffer_qubits': 15, 'multiple_chunk_swaps_enable': True, 'block_bits': 26, 'enabled': True}}, time_taken=39.79647343)], date=2024-03-13T10:11:57.034494, status=COMPLETED, header=None, metadata={'time_taken_execute': 39.96762756, 'mpi_rank': 1, 'time_taken_parameter_binding': 4.3155e-05, 'num_mpi_processes': 2, 'num_processes_per_experiments': 2, 'omp_enabled': True, 'max_gpu_memory_mb': 0, 'max_memory_mb': 15903, 'parallel_experiments': 1}, time_taken=39.96878981590271)
And again I tried adding qubits to 31 with device as CPU
and ran on two nodes, it generated error:
Simulation failed and returned the following error message:
ERROR: [Experiment 0] Insufficient memory to run circuit circuit-164 using the statevector simulator. Required memory: 16384M, max memory: 15903M
Result(backend_name='aer_simulator', backend_version='0.14.0', qobj_id='', job_id='620aee11-405f-486d-8c1c-1dfae26aeb32', success=False, results=[ExperimentResult(shots=0, success=False, meas_level=2, data=ExperimentResultData(), status=ERROR: Insufficient memory to run circuit circuit-164 using the statevector simulator. Required memory: 16384M, max memory: 15903M, circ_id=0, seed_simulator=0, metadata={'batched_shots_optimization': False, 'measure_sampling': False, 'max_memory_mb': 15903, 'remapped_qubits': False, 'active_input_qubits': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30], 'num_clbits': 31, 'num_qubits': 31, 'device': 'CPU', 'input_qubit_map': [[30, 30], [29, 29], [12, 12], [11, 11], [10, 10], [9, 9], [8, 8], [7, 7], [6, 6], [5, 5], [4, 4], [3, 3], [2, 2], [1, 1], [0, 0], [13, 13], [14, 14], [15, 15], [16, 16], [17, 17], [18, 18], [19, 19], [20, 20], [21, 21], [22, 22], [23, 23], [24, 24], [25, 25], [26, 26], [27, 27], [28, 28]], 'method': 'statevector', 'required_memory_mb': 32768}, time_taken=0.0)], date=2024-03-13T10:21:28.585262, status=ERROR: [Experiment 0] Insufficient memory to run circuit circuit-164 using the statevector simulator. Required memory: 16384M, max memory: 15903M, header=None, metadata={'time_taken_execute': 0.011740267, 'mpi_rank': 0, 'time_taken_parameter_binding': 5.0978e-05, 'num_mpi_processes': 2, 'num_processes_per_experiments': 2, 'omp_enabled': True, 'max_gpu_memory_mb': 0, 'max_memory_mb': 15903, 'parallel_experiments': 1}, time_taken=0.023772716522216797)
Simulation failed and returned the following error message:
ERROR: [Experiment 0] Insufficient memory to run circuit circuit-158 using the statevector simulator. Required memory: 16384M, max memory: 15903M
Result(backend_name='aer_simulator', backend_version='0.14.0', qobj_id='', job_id='6fc632cc-f5ba-4373-977f-d8dd20980c6b', success=False, results=[ExperimentResult(shots=0, success=False, meas_level=2, data=ExperimentResultData(), status=ERROR: Insufficient memory to run circuit circuit-158 using the statevector simulator. Required memory: 16384M, max memory: 15903M, circ_id=0, seed_simulator=0, metadata={'batched_shots_optimization': False, 'measure_sampling': False, 'max_memory_mb': 15903, 'remapped_qubits': False, 'active_input_qubits': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30], 'num_clbits': 31, 'num_qubits': 31, 'device': 'CPU', 'input_qubit_map': [[30, 30], [29, 29], [12, 12], [11, 11], [10, 10], [9, 9], [8, 8], [7, 7], [6, 6], [5, 5], [4, 4], [3, 3], [2, 2], [1, 1], [0, 0], [13, 13], [14, 14], [15, 15], [16, 16], [17, 17], [18, 18], [19, 19], [20, 20], [21, 21], [22, 22], [23, 23], [24, 24], [25, 25], [26, 26], [27, 27], [28, 28]], 'method': 'statevector', 'required_memory_mb': 32768}, time_taken=0.0)], date=2024-03-13T10:21:28.535773, status=ERROR: [Experiment 0] Insufficient memory to run circuit circuit-158 using the statevector simulator. Required memory: 16384M, max memory: 15903M, header=None, metadata={'time_taken_execute': 0.013288266, 'mpi_rank': 1, 'time_taken_parameter_binding': 5.1933e-05, 'num_mpi_processes': 2, 'num_processes_per_experiments': 2, 'omp_enabled': True, 'max_gpu_memory_mb': 0, 'max_memory_mb': 15903, 'parallel_experiments': 1}, time_taken=0.031948089599609375)
Queries:
Here memory required is 16384M
and two nodes together make 15903+15903=31806M
which is sufficient for the circuit if it shared resources, but as its running as two independent circuit it generate error.
Similar Error is being generated when I run with device=GPU
only now its from CUDA
std::bad_alloc: cudaErrorMemoryAllocation: out of memory
So main problem is my circuit is not running by distributing statevector and sharing resources. How can I achieve this?
Hi @doichanj, Is there any update on the issue?
btw I asked this question on openmpi issues and according to there response this is some sort of type error
size_t instead of an int to call MPI_Irecv
.
Can you please suggest what I can do to resolve this or I need to wait for a patch?
Just want to make one thing clear, if my circuit is taking total of 16G RAM, calling two mpi process on two nodes (one each) will divide the required resources (8G on each node) or not as in my case both nodes are using 16G RAM as two independent processes (statevectors) are running as opposed to distribution of one statevector.
Is there any update on the issue? I ran into the same problem using intel mpi and run the command mpirun -np-2-machinefile hostfile python example.py
Informations
What is the current behavior?
I am running a code to create GHZ state using 30 qubits, using statevector simulator which generated insufficient memory error
qiskit.exceptions.QiskitError: 'ERROR: [Experiment 0] Insufficient memory to run circuit circuit-158 using the statevector simulator. Required memory: 16384M, max memory: 15903M , ERROR: Insufficient memory to run circuit circuit-158 using the statevector simulator. Required memory: 16384M, max memory: 15903M'
I added a node and run script with two nodes but it spilled above error:command:
mpirun -np 2 -machinefile machinefile.txt python3 ghz.py
Error:
Here is the code
Steps to reproduce the problem
Running code with mpirun generates error
What is the expected behavior?
Insufficient Memory issue should be resolved and code should able to simulate GHZ state.
Suggested solutions
The error is in MPI_Irecv method of MPI and MPI_ERR_COUNT: invalid count argument suggests that there is some mismatch in argument type.