QP2 singularity running in parallel (OpenMPI) on multiple nodes

bavramidis commented 1 year ago

Hello,

I have been able to build multiple QP2 singularities by cloning the following "https://github.com/QuantumPackage/qp2 --branch=dev", and these seem to work fine when run on a single node. The code cloned from the above link is parallelized on a single node and I am able to run using multiple CPU's which provides for a significant speedup in running a CIPSI calculation.

I am having trouble, however, in running the singularity on multiple nodes. The singularity template I am using including the .def file can be found through this link: "https://github.com/Ydrnan/qp2-singularity".

Any advice in going about this in the proper way would be appreciated. I believe the issue to be improper communication between the MPI and QP2 singularity.

The following "https://apptainer.org/docs/user/1.0/mpi.html" hybrid singularity on its own works in parallel on multiple nodes on our HPC, How can I implement MPI use on multiple nodes within QP2?

The openMPI version which works on our HPC using the apptainer tutorial in the above link is 4.0.5

Thank you in advance,

Ben

scemama commented 1 year ago

Hello,

MPI is not required to run QP in multi-node, as the communications occur with the ZeroMQ library with TCP sockets.

You first need to run a standard single node calculation:

qp_run fci <EZFIO>

This will be the "master" run. It opens a ZeroMQ socket at an address and port number stored in the <EZFIO>/work/qp_run_address file. It should look like: tcp://192.1268.1.91:47279

On another machine, you can run a "slave" calculation that will connect to the master to accelerate it:

qp_run --slave fci <EZFIO>

If the file system is shared, the slave calculation will read the qp_run_address file to get the address and port number of qp_run to attach to it.

You can run as many slaves as you want, and you can start them at any time.

If you want to use multiple slaves, then it is worth using MPI for the slave process:

mpirun qp_run --slave fci <EZFIO>

In this mode, only rank zero of the slave will make the ZeroMQ connection to the master, and the common data will be propagated to all the other slaves using a MPI broadcast, which is much faster than doing multiple ZeroMQ communications.

If you look at the qp_srun script, you will see that it is exactly doing that:

srun -N 1 -n 1 qp_run $PROG $INPUT &    # Runs the master
srun -n $((${SLURM_NTASKS}-1))  \
    qp_run --slave $PROG $INPUT > $INPUT.slaves.out  # Runs N-1 slaves as a single MPI run

Warning: Only the Davidson diagonalization and PT2 selection/perturbation take advantage of multi-node parallelism.

So to answer your question, the only thing you need is to make it possible for the slaves to connect to the master. The simplest way is to put them on the same network. If you can't do it, you can run qp_tunnel instances on each machine on the path between the slave and the master, and the tcp packets will be forwarded from one network to the next.

You can have a look at this presentation to better understand how all this works: https://zenodo.org/records/4321326/preview/JCAD2019AScemama.pdf

Important: There was something wrong on the dev branch. We have created the dev-stable branch, where we have taken all the good things of the dev branch and thrown away all the changes that broke backwards compatibility, so the dev branch has been discontinued and will never be merged into the master. I suggest that you use instead dev-stable.

bavramidis commented 1 year ago

Hello @scemama ,

Thank you for this information.

Efforts in first running 'qp_run fci ' as the master followed by running 'qp_run --slave fci ' as a separate job within the same directory seems to properly connect the slave to the master. However, shortly after the slave reads in the tcp address from the master, I run into a floating point error and the calculation exits with error code 136.

See output for both master and slave. FCI_Master.txt FCI_Slave.txt

Once the slave job runs into this floating point error, the calculation terminates. The master job continues to run but it does not carry the calculation any further as seen from it cancelling due to time limit.

It is worth noting that the slurm file used to submit both these calculation uses the QP2 singularity but also has singularity, openmpi, gcc and libzmq modules loaded. Without the libzmq module loaded within the slurm, the slave does not connect to the master as it does in the above FCI files.

See slurm file: slurm.txt

My guess is either 1.) I have not properly created the singularity, though this works fine using one node. Or 2.) I do not have a necessary module loaded within the slurm file.

Thank you and I appreciate any additional feedback,

Ben

scemama commented 1 year ago

Hi @bavramidis, from your outputs, it seems that you are very close! Your master and slave are communicating well: at line 454 of the output of the slave, it says "selection.00000007" which means that qp_run is doing its 7th parallel kernel, which is a selection. So the ZeroMQ part is OK.

Can you post the files run1.sh, and /qp2/src/fci/IRPF90_temp/cipsi/selection.irp.F90 present in the container?
Which configuration file have you used when you did ./configure -c before compiling QP?

bavramidis commented 1 year ago

Hi @scemama ,

The issue is resolved using the dev-stable branch, as you had mentioned prior.

Thank you for your help!

Ben

QuantumPackage / qp2

QP2 singularity running in parallel (OpenMPI) on multiple nodes #306