Qiskit / qiskit-aer

Aer is a high performance simulator for quantum circuits that includes noise models
https://qiskit.github.io/qiskit-aer/
Apache License 2.0
504 stars 361 forks source link

MPI segmentation fault with simple circuit on 1 node #1788

Open mdepasca opened 1 year ago

mdepasca commented 1 year ago

Informations

What is the current behavior?

We build qiskit-aer with MPI support (intelMPI) on an HPC system. Currently we are trying to run this simple test script

from qiskit import *
from qiskit.circuit.library import QuantumVolume
from qiskit.providers.aer import *
from qiskit.utils import algorithm_globals

consistent_seed_to_all_processes = 12345
algorithm_globals.random_seed = consistent_seed_to_all_processes

sim = AerSimulator(method='statevector', device='CPU', blocking_qubits=5)

shots = 100
depth = 3
qubits = 3
circuit = transpile(QuantumVolume(qubits, depth, seed=2),
                    backend=sim,
                    optimization_level=0)

print(circuit)

circuit.measure_all()
result = execute(circuit, sim, shots=shots,
                 blocking_enable=True, blocking_qubits=5).result()

dict = result.to_dict()
print(dict.keys())
meta = dict['metadata']
myrank = meta['mpi_rank']
print(myrank)

with the following resource:

What we experience is a Segmentation Fault error from some or all the tasks (the discriminating factor is not clear) at the end of the script, see the partial output below

[...]
     ┌──────────┐┌──────────┐┌──────────┐
q_0: ┤0         ├┤0         ├┤0         ├
     │  su4_837 ││          ││          │
q_1: ┤1         ├┤  su4_262 ├┤  su4_110 ├
     └──────────┘│          ││          │
q_2: ────────────┤1         ├┤1         ├
                 └──────────┘└──────────┘
dict_keys(['backend_name', 'backend_version', 'date', 'header', 'qobj_id', 'job_id', 'status', 'success', 'results', 'metadata', 'time_taken'])
13
[...]
===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 51068 RUNNING AT i23r02c05s12
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================
[...]
===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 14 PID 51082 RUNNING AT i23r02c05s12
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

Steps to reproduce the problem

We built and installed qiskit-aer in an Anaconda 3 (2021.05) environment with the following dependencies:

and running

python ./setup.py bdist_wheel -- -DAER_MPI=True -DBUILD_TESTS=True
pip install dist/qiskit_aer-*.whl

Then we run the script as follow

srun -n 16 python 1_mpi_test_CPU.py

What is the expected behavior?

We expect the script to end with no Segmentation Fault errors.

Suggested solutions

None so far

doichanj commented 1 year ago

I found MPI issue and I posted PR #1808 I do not know this fix is related to this issue or not, but I could not reproduce the error with this PR.

By the way, in this example, blocking_qubits=5 is not correct because the number of qubit of circuit is 3 that is less than blocking qubits. If using 16 processes to parallelize the simulation, (number of qubits) - (blocking qubits) should be greater or equal 4.

mdepasca commented 1 year ago

We performed the suggested changes and re-installed qiskit-aer following the acceptance of PR #1808.

Unfortunately, nothing changed when running our script: we are still receiving Segmentation Fault messages, both from Intel MPI and from Open MPI.

mdepasca commented 1 year ago

WRT the first version of the script, I updated as follows:

qubits = 12 blockingQubits = qubits - 4

...

depth = 5

...

circuit.measure_all() result = execute( circuit, sim, shots=shots, blocking_enable=True, blocking_qubits=blockingQubits ).result()



and run on 16 processes. This has not helped
doichanj commented 1 year ago

I tested with the latest source code of Qiskit Aer, but I could not reproduce segmentation fault with the script with 16 processes / node. I tried changing some build options and parameters in the scripts but it runs correctly. Could you please provide debug trace?

mdepasca commented 1 year ago

How would you suggest me to produce such debug trace?

doichanj commented 1 year ago

Stack trace can be obtained by using gdb with dumped core file. (by using bt command after reading core file) To get stack trace, please add -g compiler option, by adding one line below in CMakeLists.txt

set(AER_COMPILER_FLAGS "${AER_COMPILER_FLAGS} -g")

mdepasca commented 1 year ago

Thank you. I understand I should have a core dump file; however that is actually not created by the seg-fault of the MPI ranks. Do you have any suggestion on how to get around this?

doichanj commented 1 year ago

Before running the program, set the core file size to unlimited. ulimit -c unlimited Then after segv occurs, core file can be loaded to gdb by using coredumpctl coredumpctl gdb -1 And type bt to get the trace.

mdepasca commented 1 year ago

Unfortunately, I can't produce such file on the system I am on. It is an HPC system and sysadmin was very clear about the fact that systemd-coredump is not installed (and likely is not going to be installed, I may add).