Open mdepasca opened 1 year ago
I found MPI issue and I posted PR #1808 I do not know this fix is related to this issue or not, but I could not reproduce the error with this PR.
By the way, in this example, blocking_qubits=5
is not correct because the number of qubit of circuit is 3 that is less than blocking qubits.
If using 16 processes to parallelize the simulation, (number of qubits) - (blocking qubits) should be greater or equal 4.
We performed the suggested changes and re-installed qiskit-aer following the acceptance of PR #1808.
Unfortunately, nothing changed when running our script: we are still receiving Segmentation Fault messages, both from Intel MPI and from Open MPI.
WRT the first version of the script, I updated as follows:
qubits - 4
# ...
# consistent_seed_to_all_processes = 12345
# algorithm_globals.random_seed = consistent_seed_to_all_processes
qubits = 12 blockingQubits = qubits - 4
depth = 5
circuit.measure_all() result = execute( circuit, sim, shots=shots, blocking_enable=True, blocking_qubits=blockingQubits ).result()
and run on 16 processes. This has not helped
I tested with the latest source code of Qiskit Aer, but I could not reproduce segmentation fault with the script with 16 processes / node. I tried changing some build options and parameters in the scripts but it runs correctly. Could you please provide debug trace?
How would you suggest me to produce such debug trace?
Stack trace can be obtained by using gdb
with dumped core file. (by using bt
command after reading core file)
To get stack trace, please add -g
compiler option, by adding one line below in CMakeLists.txt
set(AER_COMPILER_FLAGS "${AER_COMPILER_FLAGS} -g")
Thank you. I understand I should have a core dump file; however that is actually not created by the seg-fault of the MPI ranks. Do you have any suggestion on how to get around this?
Before running the program, set the core file size to unlimited.
ulimit -c unlimited
Then after segv occurs, core file can be loaded to gdb by using coredumpctl
coredumpctl gdb -1
And type bt
to get the trace.
Unfortunately, I can't produce such file on the system I am on. It is an HPC system and sysadmin was very clear about the fact that systemd-coredump is not installed (and likely is not going to be installed, I may add).
Informations
What is the current behavior?
We build qiskit-aer with MPI support (intelMPI) on an HPC system. Currently we are trying to run this simple test script
with the following resource:
What we experience is a Segmentation Fault error from some or all the tasks (the discriminating factor is not clear) at the end of the script, see the partial output below
Steps to reproduce the problem
We built and installed
qiskit-aer
in an Anaconda 3 (2021.05) environment with the following dependencies:cmake 3.21.4
gcc 11.2.0
openblas 0.3.18
build withgcc 11
intelMPI 3.1
build bygcc
mpi4py
Python modulepy-pybind11
Python moduleand running
Then we run the script as follow
What is the expected behavior?
We expect the script to end with no Segmentation Fault errors.
Suggested solutions
None so far